Back to Jobs

Site Reliability Engineer

TrulyRemote Verified

Hand-curated global remote job with direct application link

Technical Requirements

PostgresAWSPulumiTerraformKubernetesOpenTelemetryGrafana

About the Role

Supabase manages millions of Postgres instances and is growing. We are concentrating our reliability efforts into a dedicated SRE practice that ties the discipline together across the platform. You will be embedded within Service Operations, with the primary goal of making every engineering team more reliable by establishing the practices, frameworks, and feedback loops that allow them to own their reliability.

What You'll Own

  • Partner with service teams to define meaningful SLIs and SLOs grounded in customer experience, and build the error budget policies that turn them into engineering decisions.
  • Own and evolve the Operational Readiness Review (ORR) process, conducting reviews for new services and major changes across observability, alerting, runbooks, capacity, and graceful degradation.
  • Strengthen the incident-to-improvement pipeline: connecting postmortem findings to operational readiness gaps, identifying repeat failure patterns, and driving systemic fixes.
  • Act as the reliability expert teams pull in for architecture reviews, failure mode analysis, dependency mapping, and resilience design.
  • Identify and quantify operational toil across the organization, and build or advocate for automation that eliminates it.
  • Help teams design sustainable on-call practices, including alert quality, escalation paths, runbook coverage, and noise reduction.
  • Track and report on organization-wide operational maturity, surfacing systemic gaps and driving remediation.

You Might Be a Good Fit If You

  • Have 7+ years of experience in SRE, production engineering, or reliability-focused roles, including experience shaping SRE practices and driving adoption across engineering teams.
  • Have a software engineering mindset—you write code and build tools, not just configure them.
  • Have hands-on experience defining and operationalizing SLOs/SLIs at scale, including error budget policies that influenced engineering decisions.
  • Have deep experience with incident response, postmortem facilitation, and turning incident learnings into systemic improvements.
  • Have worked with large-scale multi-tenant systems (bonus: managed database platforms or Postgres).
  • Are proficient with cloud infrastructure (AWS preferred) and infrastructure-as-code (Pulumi preferred, Terraform/CDK also acceptable).
  • Communicate clearly and persuasively, as this role requires influencing without authority across a distributed organization.
Site Reliability Engineer
Supabase
Apply