Who We Are
Lightning AI is the company behind PyTorch Lightning. Founded in 2019, we build an end-to-end platform for developing, training, and deploying AI systems—designed to take ideas from research to production with less friction.
Through our merger with Voltage Park, a neocloud and AI Factory, Lightning AI combines developer-first software with cost-efficient, large-scale compute. Teams get the tools they need for experimentation, training, and production inference, with security, observability, and control built in.
What We’re Looking For
Lightning AI is seeking a Senior Network Engineer with hands-on Cumulus Linux expertise to build and scale the network backbone behind our AI infrastructure platform. You’ll play a critical role in designing highly reliable, automated data center networks that support some of the most demanding AI workloads in the world.
What You'll Do
- Design and deploy scalable spine/leaf network architectures for AI data centers
- Engineer high-performance Ethernet fabrics supporting GPU clusters and AI workloads
- Build and maintain EVPN/VXLAN, BGP, and high-speed routing environments
- Optimize east-west traffic flows for AI training and inference operations
- Support RoCE/RDMA networking and low-latency transport technologies
- Support backbone, DCI, WAN, and edge connectivity solutions
- Collaborate with compute, storage, AI platform, and operations teams to deliver integrated infrastructure solutions
- Develop automation and Infrastructure-as-Code (IaC) solutions for network provisioning and operations
- Troubleshoot complex network, performance, and congestion issues across distributed environments
- Improve network observability, telemetry, and operational visibility
Required Qualifications
- Experience with Cumulus NOS
- 5+ years of experience in large-scale data center networking
- Experience in spine-leaf architectures and L3 fabrics
- Experience with BGP, EVPN, VXLAN
- Experience operating high-performance computing (HPC) or GPU-dense environments
- Experience designing networks for hyperscalers, neoclouds, or high-scale SaaS infrastructure
- Experience in automation with Python, Ansible, or Terraform
- Experience with network observability tooling and telemetry pipelines
Ideal Experience
- Familiarity with NVIDIA networking (Spectrum, Quantum, BlueField, etc.)
- Familiarity with RDMA, RoCE, or InfiniBand fabrics
- Experience with multi-region backbone design
- Exposure to bare-metal provisioning systems