Prime Intellect18.04.26
AI SCORE 8.5

Research Engineer - Large-Scale RL Training Infrastructure

$120K–$150K/year

About the Role

We are seeking a Research Engineer for Large-Scale RL Training Infrastructure to join our innovative team at Prime Intellect. This remote position offers the opportunity to work on cutting-edge AI infrastructure that supports the development of superintelligent systems. If you are passionate about optimizing performance and enhancing the efficiency of large-scale model training, we want to hear from you!

What You'll Work On

  • Build and optimize the systems infrastructure behind large-scale RL and distributed training workloads.
  • Improve end-to-end training efficiency across compute, memory, networking, and scheduling layers.
  • Design and implement low-level performance optimizations, including kernels, communication paths, and runtime improvements.
  • Work on distributed training systems spanning data, tensor, and pipeline parallel workloads.
  • Help shape the architecture of our RL training stack, including async rollout and post-training systems.
  • Contribute to open-source libraries and internal infrastructure used for frontier-scale model training.
  • Collaborate closely with researchers and infrastructure engineers to translate bottlenecks into concrete systems improvements.
  • Stay at the frontier of training systems, inference systems, compiler/runtime tooling, and hardware-aware optimization techniques.

Requirements

  • Strong systems engineering experience in AI/ML infrastructure, especially around large-scale model training or inference.
  • Deep familiarity with PyTorch and distributed training frameworks such as PyTorch Distributed, DeepSpeed, FSDP, Megatron, vLLM, Ray, or related tooling.
  • Experience optimizing training performance across kernels, memory movement, communication overhead, or parallelization strategy.
  • Hands-on experience with large-scale training techniques including data parallelism, tensor parallelism, and pipeline parallelism.
  • Strong understanding of GPU architecture, profiling, and performance debugging.
  • Ability to identify bottlenecks across the stack and drive improvements from first principles.
  • Comfort working in a fast-moving environment with ambiguous problems and high ownership.

Nice to Have

  • Experience writing or optimizing CUDA / Triton kernels.
  • Experience with compiler or runtime optimization for ML systems.
  • Experience working on RL training infrastructure, rollout systems, or asynchronous training pipelines.
  • Experience with multi-node GPU clusters and high-performance networking.
  • Contributions to open-source ML systems or infrastructure projects.
  • Interest in publishing technical work or sharing insights through engineering blogs and technical writing.

What We Offer

  • Competitive compensation, including equity.
  • Flexible work arrangements, with the option to work remotely or in person from our San Francisco office.
  • Visa sponsorship and relocation support for international candidates.
  • Quarterly team offsites, hackathons, conferences, and learning opportunities.
  • A deeply technical, high-agency team working on infrastructure for open superintelligence.

If you’re excited about building the systems foundation for frontier-scale RL and open superintelligence, we’d love to hear from you.

Why This Job8.5 of 10

Prime Intellect offers an exciting opportunity for a Research Engineer to work on large-scale RL training infrastructure. This remote role provides competitive compensation and relocation support, making it ideal for those looking to advance their AI careers.

Salary Range
Required
0/1
Optional
0/1
Bonus
0/1

Generating success profile...

Analyzing job requirements and market data

Loading market overview...

Analyzing market trends and skill demands

Industry News

Loading latest industry news...

Finding relevant articles from the last 6 months

All job postings are automatically gathered by algorithms. We do not review or verify listings, be careful when applying and do not sign-in with iCloud or Google services.