Senior MLOps Engineer - Training & Inference Optimization
About the Role
We are seeking a Senior MLOps Engineer to join our team remotely and lead the charge in Training & Inference Optimization. In this role, you will architect the infrastructure that powers our next-generation AI models, ensuring they are state-of-the-art and production-ready. You will work with cutting-edge technologies and collaborate with a multicultural team dedicated to pushing the boundaries of quantum computing and artificial intelligence.
What You'll Do
- Architect and maintain scalable distributed training pipelines using NVIDIA NeMo, optimizing GPU utilization and implementing automated fault tolerance.
- Lead the deployment of large language models (LLMs) using vLLM, TensorRT-LLM, or SGLang, tuning techniques to maximize throughput.
- Utilize SLURM, Flyte, Ray, or SkyPilot for workload orchestration across diverse cloud providers.
- Standardize model tracking and versioning using MLflow, ensuring reproducible training runs.
- Conduct deep-dive profiling and bottleneck analysis across the full stack, from CUDA kernels to Python-level orchestration.
- Monitor and optimize GPU expenditures through intelligent scaling policies.
- Drive the engineering roadmap, perform rigorous code reviews, and mentor junior engineers.
Requirements
- 5+ years of experience in MLOps, DevOps, or Software Engineering, with at least 2 years focused on LLM infrastructure.
- Expert-level proficiency in PyTorch and the NVIDIA stack (CUDA, NCCL, Triton).
- Hands-on experience with NVIDIA NeMo or Megatron-Bridge for distributed training.
- Proven experience with SLURM, Flyte, Ray, or SkyPilot for cluster management.
- Deep expertise in Kubernetes and K8s operators.
- Mastery of Python and a functional understanding of C++ or Rust.
- Familiarity with high-performance networking and NVIDIA H200/B200 architectures.
Nice to Have
- Active contributions to relevant open-source projects.
- Experience with model compression techniques.
- Expertise in ML observability stacks like Prometheus and Grafana.
What We Offer
- Comprehensive relocation package to help you settle into your new role.
- Visa sponsorship for international candidates.
- Competitive salary with performance bonuses.
- Language courses to help you adapt to your new environment.
- A multicultural and inclusive workplace that values diversity.
- Opportunities for professional growth and development.
- Work alongside world-leading experts in AI and quantum computing.
This Senior MLOps Engineer position offers an exciting opportunity to work remotely with a leading quantum computing company. Enjoy competitive compensation and comprehensive relocation support.
Generating success profile...
Analyzing job requirements and market data
Loading market overview...
Analyzing market trends and skill demands
Industry News
Loading latest industry news...
Finding relevant articles from the last 6 months