DevOps/MLOps Engineer (ML / LLM Infrastructure)

• Posted yesterday

TrulyRemote Verified

Hand-curated global remote job with direct application link

Technical Requirements

GCPKubernetesDockerGitHub ActionsJenkinsAirflowTerraformPython

Responsibilities:

Design, build, and operate scalable ML infrastructure on GCP (GKE), supporting both experimentation and production workloads for LLMs and NLP systems.
Manage Kubernetes-based environments (GKE): deployment, scaling, upgrades, and reliability of training and inference workloads across GPU/TPU/CPU pools.
Build and maintain CI/CD pipelines (GitHub Actions, Jenkins) to automate testing, training, and deployment of ML services and infrastructure.
Implement infrastructure as code (Terraform, Ansible) to provision and manage cloud resources in a reproducible, secure, and cost-efficient way.
Ensure observability of ML systems: monitoring, logging, and alerting for infrastructure, pipelines, and production inference workloads.
Collaborate with ML engineers and Data Engineers to design and support reliable training and inference pipelines.
Optimize resource utilization and cost, improving efficiency of training and serving infrastructure.
Troubleshoot and resolve issues across the ML platform - from data pipelines to distributed training and production deployments.
Contribute to engineering best practices: code reviews, automation, and continuous improvement of platform reliability and developer experience.

Required Qualifications:

Experience: 4+ years in DevOps, Platform Engineering, or ML Infrastructure roles, with strong understanding of production systems and distributed workloads.
Cloud & Infrastructure: Hands-on experience with GCP. other major cloud platforms is a plus. Strong understanding of cloud-native architectures and experience designing scalable systems for compute and data-intensive workloads.
Kubernetes & Containers: Solid experience with Docker and Kubernetes (preferably GKE), including deploying, scaling, and operating production workloads. Familiarity with Helm and Kubernetes networking fundamentals.
CI/CD & Automation: Experience building and maintaining CI/CD pipelines (GitHub Actions, Jenkins, or similar) to automate testing, deployment, and infrastructure changes.
Workflow Orchestration: Experience with Airflow (or similar tools).
Infrastructure as Code: Strong experience with Terraform (preferred) or similar tools for provisioning and managing infrastructure in a reproducible way.
Programming: Strong hands-on scripting languages experience (Bash and/ or Python).
Observability & Reliability: Experience with monitoring and logging systems (e.g., Prometheus, Grafana). Understanding of reliability, alerting, and debugging in distributed systems.
ML Infrastructure Understanding: Familiarity with the ML lifecycle (training, evaluation, inference) and experience supporting ML workloads in production environments.
Collaboration: Ability to work closely with ML Engineers and Data Engineers, translating ML requirements into reliable and scalable infrastructure solutions.

What we offer:

Office or remote — it’s up to you.
Remote onboarding
Performance bonuses
We train employees with the opportunity to learn through the company’s library, internal resources, and programs from partners
Health and life insurance
Wellbeing program and corporate psychologist
Reimbursement of expenses for Kyivstar mobile communication

Similar Jobs

Analytics Engineer

LearnWorlds

Python Engineers

Slasify

SR SDET / Sr QA Automation Engineer (Python, CLI, CI/CD, Containers)

RAPIDFORT

Posted yesterday

Senior Staff Software Engineer

LeoLabs

Posted yesterday

Postgres Deployment Engineer

Supabase

Posted yesterday

DevOps/MLOps Engineer (ML / LLM Infrastructure)