Back to Jobs

DevOps/MLOps Engineer (ML / LLM Infrastructure)

TrulyRemote Verified

Hand-curated global remote job with direct application link

Technical Requirements

GCPKubernetesDockerGitHub ActionsJenkinsAirflowTerraformPython

Responsibilities:

  • Design, build, and operate scalable ML infrastructure on GCP (GKE), supporting both experimentation and production workloads for LLMs and NLP systems.
  • Manage Kubernetes-based environments (GKE): deployment, scaling, upgrades, and reliability of training and inference workloads across GPU/TPU/CPU pools.
  • Build and maintain CI/CD pipelines (GitHub Actions, Jenkins) to automate testing, training, and deployment of ML services and infrastructure.
  • Implement infrastructure as code (Terraform, Ansible) to provision and manage cloud resources in a reproducible, secure, and cost-efficient way.
  • Ensure observability of ML systems: monitoring, logging, and alerting for infrastructure, pipelines, and production inference workloads.
  • Collaborate with ML engineers and Data Engineers to design and support reliable training and inference pipelines.
  • Optimize resource utilization and cost, improving efficiency of training and serving infrastructure.
  • Troubleshoot and resolve issues across the ML platform - from data pipelines to distributed training and production deployments.
  • Contribute to engineering best practices: code reviews, automation, and continuous improvement of platform reliability and developer experience.

Required Qualifications:

  • Experience: 4+ years in DevOps, Platform Engineering, or ML Infrastructure roles, with strong understanding of production systems and distributed workloads.
  • Cloud & Infrastructure: Hands-on experience with GCP. other major cloud platforms is a plus. Strong understanding of cloud-native architectures and experience designing scalable systems for compute and data-intensive workloads.
  • Kubernetes & Containers: Solid experience with Docker and Kubernetes (preferably GKE), including deploying, scaling, and operating production workloads. Familiarity with Helm and Kubernetes networking fundamentals.
  • CI/CD & Automation: Experience building and maintaining CI/CD pipelines (GitHub Actions, Jenkins, or similar) to automate testing, deployment, and infrastructure changes.
  • Workflow Orchestration: Experience with Airflow (or similar tools).
  • Infrastructure as Code: Strong experience with Terraform (preferred) or similar tools for provisioning and managing infrastructure in a reproducible way.
  • Programming: Strong hands-on scripting languages experience (Bash and/ or Python).
  • Observability & Reliability: Experience with monitoring and logging systems (e.g., Prometheus, Grafana). Understanding of reliability, alerting, and debugging in distributed systems.
  • ML Infrastructure Understanding: Familiarity with the ML lifecycle (training, evaluation, inference) and experience supporting ML workloads in production environments.
  • Collaboration: Ability to work closely with ML Engineers and Data Engineers, translating ML requirements into reliable and scalable infrastructure solutions.

What we offer:

  • Office or remote — it’s up to you.
  • Remote onboarding
  • Performance bonuses
  • We train employees with the opportunity to learn through the company’s library, internal resources, and programs from partners
  • Health and life insurance
  • Wellbeing program and corporate psychologist
  • Reimbursement of expenses for Kyivstar mobile communication
DevOps/MLOps Engineer (ML / LLM Infrastructure)
Kyivstar.Tech
Apply