Back to Jobs

Senior Site Reliability Engineer

TrulyRemote Verified

Hand-curated global remote job with direct application link

Technical Requirements

KubernetesGoPythonLinuxCloud InfrastructureDistributed SystemsObservabilityTerraform

About the Role

HavocAI is seeking a Senior Site Reliability Engineer with 7+ years of experience designing, operating, and scaling highly reliable distributed systems. In this role, you will serve as a key technical leader within the Cloud Platform team, responsible for ensuring the availability, performance, and resilience of mission-critical services supporting autonomy, simulation, and data-intensive workloads.

Key Responsibilities

Reliability Engineering & Architecture

  • Design and evolve reliability architecture for distributed and cloud-hosted systems
  • Define and implement SRE best practices, including SLIs, SLOs, error budgets, and capacity planning
  • Partner with platform and application teams to design systems for reliability, scalability, and operability
  • Identify and mitigate systemic reliability risks across infrastructure, applications, services, and data pipelines

Operations & Incident Management

  • Lead incident response processes, including on-call rotations, escalation paths, and post-incident reviews
  • Conduct root cause analysis for complex production incidents and drive long-term corrective actions
  • Improve operational readiness through runbooks, automation, resilience testing, and production-readiness reviews
  • Reduce operational toil through tooling, automation, and process improvements

Observability & Performance

  • Design, implement, and maintain observability systems for metrics, logging, tracing, alerting, and service health
  • Ensure services and data pipelines are observable, debuggable, and performant in production
  • Drive performance analysis and tuning across infrastructure, application, and service layers

Automation & Platform Collaboration

  • Build automation to improve system reliability, deployment safety, and recovery processes
  • Partner with DevOps and Cloud Platform teams on CI/CD reliability, rollout strategies, and safe deployment patterns
  • Support and improve Kubernetes-based environments and containerized workloads

Security & Resilience

  • Collaborate with security teams to ensure secure and resilient system design
  • Participate in disaster recovery planning, backup strategy, and resilience testing

Requirements

  • 7+ years of experience in SRE, infrastructure engineering, systems engineering, or related roles
  • Strong experience operating large-scale distributed production systems
  • Deep understanding of Linux systems, networking, cloud infrastructure, and distributed systems fundamentals
  • Hands-on experience with Kubernetes and container orchestration
  • Programming or scripting experience in Go, Python, or similar languages
  • Experience designing and operating observability systems for production environments
  • Must be a U.S. Citizen and eligible to obtain a U.S. Government security clearance if required
Senior Site Reliability Engineer
HavocAI
Apply