About The Role
As a Senior Site Reliability Engineer, you won’t just be "keeping the lights on." You will be an engineering force responsible for the architecture, scalability, and self-healing capabilities of our Brokerage-as-a-Service platform.
This role is centered on reducing toil through engineering. You will design and develop internal SRE platforms, automate complex workflows, and ensure our Kubernetes-based ecosystem can handle the demands of global financial markets. While this role includes critical on-call responsibilities to support our 24/7 global operations, your primary mission is to build and modernize systems that make manual intervention obsolete.
What You’ll Do
- Engineering & Automation: Design and develop internal tools and SRE platforms to eliminate repetitive tasks (toil) and improve developer velocity.
- Infrastructure as Code: Architect and maintain modular, reusable IaC using Terraform and manage GitOps workflows via ArgoCD.
- Observability & Reliability: Implement OpenTelemetry standards and the Grafana stack (Alloy, Loki, Tempo, Mimir) to provide deep insights into system health. Define and manage SLIs, SLOs, and Error Budgets.
- Platform Governance: Review software architecture and Kubernetes metrics to ensure high availability, capacity planning, and cost-optimization across AWS regions.
- Incident Engineering: Lead incident response, perform complex root-cause analysis (RCA), and champion a blameless post-mortem culture.
- Collaboration: Partner with engineering teams to foster the adoption of new tools, security standards, and reliability best practices.
What You'll Need
- Linux & Networking Mastery: Proficient in Linux administration with a deep understanding of the TCP/IP stack, OSI model, DNS, and network troubleshooting.
- FinTech Background: Experience working in highly regulated financial environments or with FIX/API connectivity.
- Production Kubernetes: Hands-on experience managing production-grade clusters, including RBAC, autoscaling, Helm, and multi-cluster patterns.
- Cloud Native Expertise (AWS): Strong grasp of AWS core services, security, and high-availability patterns. Proficiency with boto3 and AWS CLI for automation.
- Modern CI/CD & GitOps: Experience building secure, automated delivery pipelines and operating GitOps workflows (ArgoCD).
- Code Proficiency: Strong scripting and development skills in Python or Golang, along with Bash and Ansible.
- Security Mindset: Experience with secrets management, vulnerability scanning, and securing the software supply chain.
- AI & Prompt Engineering: Familiarity with using LLMs, Public MCPs, or Bedrock Agent Core to enhance SRE workflows.
- Data & Middleware: Experience managing Kafka, MQ, SQS, or orchestration tools like Airflow and Rundeck.