Responsibilities:
- Design, build, and operate scalable ML infrastructure on GCP (GKE), supporting both experimentation and production workloads for LLMs and NLP systems.
- Manage Kubernetes-based environments (GKE): deployment, scaling, upgrades, and reliability of training and inference workloads across GPU/TPU/CPU pools.
- Build and maintain CI/CD pipelines (GitHub Actions, Jenkins) to automate testing, training, and deployment of ML services and infrastructure.
- Implement infrastructure as code (Terraform, Ansible) to provision and manage cloud resources in a reproducible, secure, and cost-efficient way.
- Ensure observability of ML systems: monitoring, logging, and alerting for infrastructure, pipelines, and production inference workloads.
- Collaborate with ML engineers and Data Engineers to design and support reliable training and inference pipelines.
- Optimize resource utilization and cost, improving efficiency of training and serving infrastructure.
- Troubleshoot and resolve issues across the ML platform - from data pipelines to distributed training and production deployments.
- Contribute to engineering best practices: code reviews, automation, and continuous improvement of platform reliability and developer experience.
Required Qualifications:
- Experience: 4+ years in DevOps, Platform Engineering, or ML Infrastructure roles, with strong understanding of production systems and distributed workloads.
- Cloud & Infrastructure: Hands-on experience with GCP. other major cloud platforms is a plus. Strong understanding of cloud-native architectures and experience designing scalable systems for compute and data-intensive workloads.
- Kubernetes & Containers: Solid experience with Docker and Kubernetes (preferably GKE), including deploying, scaling, and operating production workloads. Familiarity with Helm and Kubernetes networking fundamentals.
- CI/CD & Automation: Experience building and maintaining CI/CD pipelines (GitHub Actions, Jenkins, or similar) to automate testing, deployment, and infrastructure changes.
- Workflow Orchestration: Experience with Airflow (or similar tools).
- Infrastructure as Code: Strong experience with Terraform (preferred) or similar tools for provisioning and managing infrastructure in a reproducible way.
- Programming: Strong hands-on scripting languages experience (Bash and/ or Python).
- Observability & Reliability: Experience with monitoring and logging systems (e.g., Prometheus, Grafana). Understanding of reliability, alerting, and debugging in distributed systems.
- ML Infrastructure Understanding: Familiarity with the ML lifecycle (training, evaluation, inference) and experience supporting ML workloads in production environments.
- Collaboration: Ability to work closely with ML Engineers and Data Engineers, translating ML requirements into reliable and scalable infrastructure solutions.
What we offer:
- Office or remote — it’s up to you.
- Remote onboarding
- Performance bonuses
- We train employees with the opportunity to learn through the company’s library, internal resources, and programs from partners
- Health and life insurance
- Wellbeing program and corporate psychologist
- Reimbursement of expenses for Kyivstar mobile communication