Site Reliability Engineer
Holistic Approach to Systems - Operations is a Software Problem - Team-Oriented Communication
At DigitalEd, a Site Reliability Engineer is responsible for ensuring our production systems meet our customers uptime and service needs, with software engineering tools and capabilities, not relentless toil. They are pragmatic, objective, and articulate, with strong communication and teamwork capabilities. They create effective tooling and automation that enables our teams to give our customers a compelling and seamless experience with the Mobius platform.
The SRE team designs, deploys, and manages DigitalEd’s internal Private Cloud Infrastructure as well as our customer facing Google Public Cloud SaaS application infrastructure. We anticipate this role will ideally spend no more than 30 to 50% of their time on “ops” related work, and the rest of their time on software development to improve the scalability, reliability, and availability of the Mobius application.
Outcomes & Key Responsibilities: What’s Expected of You
- System Design: Engage in and improve the whole lifecycle of our service — from inception and design, through deployment, operation and refinement
- System Support: Support our service through activities such as system design consulting, developing software platforms and frameworks, and capacity planning
- System Maintenance: Maintain our service by measuring and monitoring availability, latency and overall system health; support on-call rotations with operational duties that have not been addressed with automation
- Eliminate Toil: Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity
- Incident Management: Practice sustainable incident response and blameless postmortems
Measures of Performance: How You Know You’re Doing Well
- Process Execution: Every project, automation task, and incident is executed well and completely. We ensure that all work in our system is done to the best of our ability given our knowledge, tooling, and experience.
- Customer Satisfaction: A desire to ensure a high quality of service to provide the best customer experience, by continually finding the next problem to solve, and solving it well.
- Effective Cooperation: Working with Customer Success and Development continually to ensure our customers needs are met and exceeded.
Competencies & Experience: The Stuff that Makes you Great at This
- An understanding that system failure is normal, and the ability to embrace risk as part of the job
- Demonstrated success in working through blameless post mortem processes, using techniques such as “the infinite hows”
- Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive
- Ability to debug and optimize code and automate routine tasks
- Ability to see the system as a whole and treat its interconnections with as much attention and respect as the components themselves
- Desire to automate this year’s toil away
The Technical Piece: The Knowledge and Exposure that this Role can’t Operate Without
- Advanced expertise with at least one programming language, with a preference for Java and Python; polyglot preferred
- Extensive experience in an Operational role, be it DevOps, SRE, or traditional network/server management
- Extensive experience in Linux
- Experience with cloud platforms, preference for GCP
- Experience with containers and orchestration
- Experience with database management
- Experience with incident management and response
- Experience with IaC (terraform, puppet, git)
- Experience with general networking concepts and protocols, and storage fundamentals
We’re ultimately looking for a site reliability generalist with foundations across development, system operations, resiliency testing, security hardening, and performance engineering. We’re on the hunt for someone who’s comfortable with taking on new engineering challenges, defining potential solutions, and implementing designs in a team environment; which means drive, ownership and tenacity are the key tenets of someone being super successful in this position, and in this team. If you want to be an integral part of DigitalEd’s evolution towards contemporary application and infrastructure management practices, this could be a great role for you to leave your mark on.
The Culture Part
The spirit of our aspirational culture is rooted in the concept of ‘No Deposit, No Return’. If you don’t put anything in to your professional experience, you won’t get anything out of it. To bring this to life, we believe in the pillars of our core values: Customer Orientation, Curiosity, Teamwork, Adaptability, Ownership and Coaching (for Leaders). If any of these words strike a chord, then we’ve got something in common.
Read through this posting and not sure if you’re qualified? Apply anyways. You never know where it could go, and we promise to read and review every application that comes through - with a magnifying glass we like to call the ‘Potential Detector’. Everyone has a great story, and we’d love to hear yours.
The DigitalEd People & Culture Team
PS - We know diverse teams make strong teams, so we welcome all individuals of diverse backgrounds, abilities, experiences, and perspectives to apply. If you require accommodation during the application process, simply let us know and we’ll work to ensure it’s a positive experience for you.