About the Role

HavocAI is seeking a Senior Site Reliability Engineer with 7+ years of experience designing, operating, and scaling highly reliable distributed systems. In this role, you will serve as a key technical leader within the Cloud Platform team, responsible for ensuring the availability, performance, and resilience of mission-critical services supporting autonomy, simulation, and data-intensive workloads.

Key Responsibilities

Reliability Engineering & Architecture

Design and evolve reliability architecture for distributed and cloud-hosted systems
Define and implement SRE best practices, including SLIs, SLOs, error budgets, and capacity planning
Partner with platform and application teams to design systems for reliability, scalability, and operability
Identify and mitigate systemic reliability risks across infrastructure, applications, services, and data pipelines

Operations & Incident Management

Lead incident response processes, including on-call rotations, escalation paths, and post-incident reviews
Conduct root cause analysis for complex production incidents and drive long-term corrective actions
Improve operational readiness through runbooks, automation, resilience testing, and production-readiness reviews
Reduce operational toil through tooling, automation, and process improvements

Observability & Performance

Design, implement, and maintain observability systems for metrics, logging, tracing, alerting, and service health
Ensure services and data pipelines are observable, debuggable, and performant in production
Drive performance analysis and tuning across infrastructure, application, and service layers

Automation & Platform Collaboration

Build automation to improve system reliability, deployment safety, and recovery processes
Partner with DevOps and Cloud Platform teams on CI/CD reliability, rollout strategies, and safe deployment patterns
Support and improve Kubernetes-based environments and containerized workloads

Security & Resilience

Collaborate with security teams to ensure secure and resilient system design
Participate in disaster recovery planning, backup strategy, and resilience testing

Requirements

7+ years of experience in SRE, infrastructure engineering, systems engineering, or related roles
Strong experience operating large-scale distributed production systems
Deep understanding of Linux systems, networking, cloud infrastructure, and distributed systems fundamentals
Hands-on experience with Kubernetes and container orchestration
Programming or scripting experience in Go, Python, or similar languages
Experience designing and operating observability systems for production environments
Must be a U.S. Citizen and eligible to obtain a U.S. Government security clearance if required

Senior Site Reliability Engineer

TrulyRemote Verified

Technical Requirements

About the Role

Key Responsibilities

Requirements

Similar Jobs

Senior Site Reliability Engineer

Sr VMware Engineer

Senior Full Stack Integration Engineer

Sr. Manager, Finance

Senior Cloud Architect