Back to Jobs

Senior Site Reliability Engineer

TrulyRemote Verified

Hand-curated global remote job with direct application link

Technical Requirements

PythonLinuxAWSDockerKubernetesTerraformCI/CDPrometheus

The Outcomes You’ll Deliver:

In the first few months, You'll focus on building a clear understanding of our systems and establishing the foundation for stronger observability across our platforms. As you settle in, your scope will grow to include broader reliability and performance initiatives.

  • Assess and improve visibility: Work with engineering teams to review our current dashboards, metrics, and logs, identify the biggest gaps, and make targeted improvements that help us better understand system health.
  • Tighten monitoring and alerting: Refine alerts and dashboards for the most critical services so we can catch issues earlier and respond faster.
  • Build observability into delivery: Add instrumentation and telemetry into existing build and deploy processes to make reliability checks part of our normal release workflow.
  • Clarify what "reliable" means: Help define initial SLIs and SLOs for a few core user flows, aligning the team on what good performance and availability look like.
  • Streamline incident response: Partner with the Event Commander/on-call rotation to improve how we communicate, coordinate, and follow up during incidents.
  • Reduce manual effort: Automate routine checks and monitoring tasks to free up engineers for more impactful work. Over time, you'll take on a larger role shaping how we measure, monitor, and improve reliability across all services — setting standards, mentoring others, and helping engineering teams make data-driven decisions about performance and stability.

In this role, you can expect to

  • Contribute to system observability i.e implementing, improving metrics, alerting, and dashboards for better insight and faster recovery.
  • Develop automation, tooling, and monitoring solutions to support high service availability.
  • Partner with application and quality engineering teams to implement best practices in reliability, release automation, and testing.
  • Drive operational excellence through proactive incident prevention, blameless postmortems, and capacity planning.
  • Participate in on-call rotations to support critical services and ensure rapid response to incidents.

To thrive in this role, you have

  • Solid experience in Python, especially for automation, tooling, and data-driven operational tasks.
  • Proficiency in at least one (Java, C++, or Go).
  • Strong understanding of Linux systems, cloud infrastructure (AWS, GCP, or Azure), and modern deployment practices (Docker, Kubernetes, Terraform).
  • Experience with CI/CD pipelines, version control, and automated testing frameworks.
  • Experience with observability tools (e.g., Prometheus, Grafana, ELK, Datadog, etc.) and log/metric analysis for diagnosing issues.
  • Proven experience facilitating and documenting Critical User Journeys translating them to actionable SLA/SLO for automation.
  • Demonstrated ability to collaborate with cross-functional teams and communicate clearly in high-impact situations.
  • A problem-solver who approaches reliability as a shared responsibility across engineering.
  • Familiarity with AI-augmented development tools (Claude, Codex) as part of a modern engineering workflow.

Nice to Have

  • Experience writing or maintaining end-to-end or integration tests for distributed systems.
  • Background in performance testing, capacity planning, or chaos engineering.
  • Contributions to internal developer tooling or reliability-focused frameworks.
  • Exposure to security, compliance, or change management processes in production environments.
  • Relevant certifications.
Senior Site Reliability Engineer
PlayOn
Apply