Summary

The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to join our team, reporting to the Sr. Engineering Manager. As the Site Reliability Engineer, you will play a key role in designing, developing, and maintaining reliable, scalable, and highly available infrastructure for our API services. You will contribute heavily to the high impact challenges behind innovating, building, and maintaining Wikipedia’s data feeds for high volume reusers. In this role, you will foster cross department collaboration with the wikimedia foundation SRE teams. You will own reliability targets (SLOs) for critical APIs, balancing performance, cost, and availability through data-driven decisions.

You will be involved in designing and running the infrastructure and services that interact with the base of Wikimedia Foundation’s projects, including, but not limited to: Kubernetes clusters, application servers, code collaboration infrastructure, and other developer-facing services. You will participate in incident response and be on-call.

You are responsible for:

Define, track, and improve Service Level Objectives (SLOs), SLIs, and error budgets to ensure reliability targets are met
Build and enhance observability systems (metrics, logs, and distributed tracing) to enable proactive detection and faster troubleshooting
Drive reliability engineering practices, including capacity planning, load testing, and resilience validation (e.g., chaos testing)
Improve developer experience (DevEx) by enabling self-service infrastructure and streamlining deployment workflows
Partner with engineering team members to embed reliability best practices early in the development lifecycle
Design, implement, and optimize CI/CD and GitOps workflows using tools such as GitLab (or similar) and ArgoCD(or similar), enabling automated, reliable deployments with support for progressive delivery strategies
Implement secure-by-default infrastructure and enforce best practices (e.g., IAM, secrets management, encryption)
Continuously optimize infrastructure cost and efficiency using FinOps principles while maintaining performance and availability
Reduce operational toil by implementing automation-first solutions

Skills and Experience:

Automation & Configuration Management: Experience with Infrastructure as Code (e.g., Terraform, Ansible) and proficiency in at least one programming language (e.g., Python, Go)
Cloud Infrastructure: Experience designing, operating, and optimizing cloud-based systems (e.g., AWS, GCP)
CI/CD & Deployment Practices: Experience building and maintaining CI/CD pipelines and GitOps workflows (e.g., GitLab, ArgoCD)
SRE Principles & Observability: Strong understanding of SLOs, SLIs, and observability tools (e.g., Prometheus, OpenTelemetry)
Incident Management: Experience with incident response, on-call practices, and leading postmortems

Senior Site Reliability Engineer, Wikimedia Enterprise

TrulyRemote Verified

Technical Requirements

Summary

You are responsible for:

Skills and Experience:

Similar Jobs

Data Engineer

Senior Software Engineer - Performance Pacing

Software Engineer, Gradio & Trackio

Full Stack Engineer

Software Engineer