The Outcomes You’ll Deliver:
In the first few months, You'll focus on building a clear understanding of our systems and establishing the foundation for stronger observability across our platforms. As you settle in, your scope will grow to include broader reliability and performance initiatives.
- Assess and improve visibility: Work with engineering teams to review our current dashboards, metrics, and logs, identify the biggest gaps, and make targeted improvements that help us better understand system health.
- Tighten monitoring and alerting: Refine alerts and dashboards for the most critical services so we can catch issues earlier and respond faster.
- Build observability into delivery: Add instrumentation and telemetry into existing build and deploy processes to make reliability checks part of our normal release workflow.
- Clarify what "reliable" means: Help define initial SLIs and SLOs for a few core user flows, aligning the team on what good performance and availability look like.
- Streamline incident response: Partner with the Event Commander/on-call rotation to improve how we communicate, coordinate, and follow up during incidents.
- Reduce manual effort: Automate routine checks and monitoring tasks to free up engineers for more impactful work. Over time, you'll take on a larger role shaping how we measure, monitor, and improve reliability across all services — setting standards, mentoring others, and helping engineering teams make data-driven decisions about performance and stability.
In this role, you can expect to
- Contribute to system observability i.e implementing, improving metrics, alerting, and dashboards for better insight and faster recovery.
- Develop automation, tooling, and monitoring solutions to support high service availability.
- Partner with application and quality engineering teams to implement best practices in reliability, release automation, and testing.
- Drive operational excellence through proactive incident prevention, blameless postmortems, and capacity planning.
- Participate in on-call rotations to support critical services and ensure rapid response to incidents.
To thrive in this role, you have
- Solid experience in Python, especially for automation, tooling, and data-driven operational tasks.
- Proficiency in at least one (Java, C++, or Go).
- Strong understanding of Linux systems, cloud infrastructure (AWS, GCP, or Azure), and modern deployment practices (Docker, Kubernetes, Terraform).
- Experience with CI/CD pipelines, version control, and automated testing frameworks.
- Experience with observability tools (e.g., Prometheus, Grafana, ELK, Datadog, etc.) and log/metric analysis for diagnosing issues.
- Proven experience facilitating and documenting Critical User Journeys translating them to actionable SLA/SLO for automation.
- Demonstrated ability to collaborate with cross-functional teams and communicate clearly in high-impact situations.
- A problem-solver who approaches reliability as a shared responsibility across engineering.
- Familiarity with AI-augmented development tools (Claude, Codex) as part of a modern engineering workflow.
Nice to Have
- Experience writing or maintaining end-to-end or integration tests for distributed systems.
- Background in performance testing, capacity planning, or chaos engineering.
- Contributions to internal developer tooling or reliability-focused frameworks.
- Exposure to security, compliance, or change management processes in production environments.
- Relevant certifications.