Mô Tả Công Việc
- Define a roadmap for all engineering teams to adopt fully automated, self-service, highly scalable, cost-efficient, observable, auditable, and reliable infrastructure services as standard practice.
- Drive the implementation of this roadmap across the engineering organization, collaborating with SREs and senior engineers while also actively contributing to solving critical challenges.
- Provide expert technical guidance and ongoing engineering design review to teams involved in large migrations, service-oriented architecture, architectural shifts, and capacity growth.
- Cultivate a metrics-driven operational culture, establishing standards for SLO definition and review, as well as logging, monitoring, alerting, and on-call practices.
- Continuously improve blameless incident management processes, root cause analyses, outage prevention, and service recovery strategies across the engineering organization.
- Collaborate closely with Security, Quality, and Product teams to prioritize security, privacy, compliance, reliability, and business continuity objectives in our overall roadmap.
- Propose and lead significant improvements to our production systems that have a significant impact on our business and engineering teams.
- Mentor and coach engineers, fostering curiosity and effective problem-solving skills.
Yêu Cầu Công Việc
- Define a roadmap for all engineering teams to adopt fully automated, self-service, highly scalable, cost-efficient, observable, auditable, and reliable infrastructure services as standard practice.
- Drive the implementation of this roadmap across the engineering organization, collaborating with SREs and senior engineers while also actively contributing to solving critical challenges.
- Provide expert technical guidance and ongoing engineering design review to teams involved in large migrations, service-oriented architecture, architectural shifts, and capacity growth.
- Cultivate a metrics-driven operational culture, establishing standards for SLO definition and review, as well as logging, monitoring, alerting, and on-call practices.
- Continuously improve blameless incident management processes, root cause analyses, outage prevention, and service recovery strategies across the engineering organization.
- Collaborate closely with Security, Quality, and Product teams to prioritize security, privacy, compliance, reliability, and business continuity objectives in our overall roadmap.
- Propose and lead significant improvements to our production systems that have a significant impact on our business and engineering teams.
- Mentor and coach engineers, fostering curiosity and effective problem-solving skills.