Research Engineer - Large-Scale RL Training Infrastructure
About the Role
We are seeking a Research Engineer for Large-Scale RL Training Infrastructure to join our innovative team at Prime Intellect. This remote position offers the opportunity to work on cutting-edge AI infrastructure that supports the development of superintelligent systems. If you are passionate about optimizing performance and enhancing the efficiency of large-scale model training, we want to hear from you!
What You'll Work On
- Build and optimize the systems infrastructure behind large-scale RL and distributed training workloads.
- Improve end-to-end training efficiency across compute, memory, networking, and scheduling layers.
- Design and implement low-level performance optimizations, including kernels, communication paths, and runtime improvements.
- Work on distributed training systems spanning data, tensor, and pipeline parallel workloads.
- Help shape the architecture of our RL training stack, including async rollout and post-training systems.
- Contribute to open-source libraries and internal infrastructure used for frontier-scale model training.
- Collaborate closely with researchers and infrastructure engineers to translate bottlenecks into concrete systems improvements.
- Stay at the frontier of training systems, inference systems, compiler/runtime tooling, and hardware-aware optimization techniques.
Requirements
- Strong systems engineering experience in AI/ML infrastructure, especially around large-scale model training or inference.
- Deep familiarity with PyTorch and distributed training frameworks such as PyTorch Distributed, DeepSpeed, FSDP, Megatron, vLLM, Ray, or related tooling.
- Experience optimizing training performance across kernels, memory movement, communication overhead, or parallelization strategy.
- Hands-on experience with large-scale training techniques including data parallelism, tensor parallelism, and pipeline parallelism.
- Strong understanding of GPU architecture, profiling, and performance debugging.
- Ability to identify bottlenecks across the stack and drive improvements from first principles.
- Comfort working in a fast-moving environment with ambiguous problems and high ownership.
Nice to Have
- Experience writing or optimizing CUDA / Triton kernels.
- Experience with compiler or runtime optimization for ML systems.
- Experience working on RL training infrastructure, rollout systems, or asynchronous training pipelines.
- Experience with multi-node GPU clusters and high-performance networking.
- Contributions to open-source ML systems or infrastructure projects.
- Interest in publishing technical work or sharing insights through engineering blogs and technical writing.
What We Offer
- Competitive compensation, including equity.
- Flexible work arrangements, with the option to work remotely or in person from our San Francisco office.
- Visa sponsorship and relocation support for international candidates.
- Quarterly team offsites, hackathons, conferences, and learning opportunities.
- A deeply technical, high-agency team working on infrastructure for open superintelligence.
If you’re excited about building the systems foundation for frontier-scale RL and open superintelligence, we’d love to hear from you.
Prime Intellect offers an exciting opportunity for a Research Engineer to work on large-scale RL training infrastructure. This remote role provides competitive compensation and relocation support, making it ideal for those looking to advance their AI careers.
Generating success profile...
Analyzing job requirements and market data
Loading market overview...
Analyzing market trends and skill demands
Industry News
Loading latest industry news...
Finding relevant articles from the last 6 months