Senior Engineer - Distributed Systems & ML Large-Scale Training
About the Role
We're hiring a Senior Engineer specializing in Distributed Systems and ML Large-Scale Training to join our innovative team at Pluralis Research. In this remote role, you will play a critical part in implementing a novel substrate for training distributed ML models that operate efficiently under consumer-grade internet conditions. Your expertise will help shape the future of community-trained models, ensuring they are robust and self-sustaining.
What You'll Do
- Design and implement large-scale distributed training systems optimized for heterogeneous hardware under low-bandwidth, high-latency conditions.
- Develop and optimize model-parallel training strategies using custom sharding techniques to minimize communication overhead.
- Optimize GPU utilization, memory efficiency, and compute performance across distributed nodes.
- Implement robust checkpointing, state synchronization, and recovery mechanisms for long-running, fault-prone training jobs.
- Build monitoring and metrics systems to track training progress, model quality, and identify system bottlenecks.
- Architect resilient training systems capable of handling node failures and network partitions.
- Design peer-to-peer topologies for decentralized coordination across non-co-located nodes.
- Profile and optimize communication patterns to reduce latency and bandwidth overhead in multi-participant environments.
Requirements
- 5+ years of strong experience building and operating distributed systems in production.
- Hands-on expertise with distributed training frameworks such as FSDP, DeepSpeed, or Megatron.
- Deep understanding of model parallelism techniques including data, tensor, and pipeline parallelism.
- Expert-level proficiency in Python with production experience in concurrency, error handling, and clean architecture.
- Strong networking fundamentals including P2P systems, gRPC, routing, and NAT traversal.
- Experience optimizing GPU workloads, memory management, and large-scale compute efficiency.
Nice to Have
- Familiarity with cloud platforms and services.
- Experience with containerization technologies like Docker.
- Knowledge of machine learning frameworks such as TensorFlow or PyTorch.
What We Offer
- Equity-heavy compensation with meaningful ownership in a mission-driven company.
- Competitive base salary for senior engineering roles in the United States.
- Visa sponsorship available for exceptional candidates.
- Remote-first work environment with optional access to our Melbourne hub.
- Join a world-class team with members previously at Google, Amazon, Microsoft, and leading startups.
- Be part of a company backed by Union Square Ventures and other tier-1 investors.
This Senior Engineer role at Pluralis Research offers a unique opportunity to work on cutting-edge distributed systems and machine learning projects. With a competitive salary and equity options, it's an attractive position for experienced professionals.
Generating success profile...
Analyzing job requirements and market data
Loading market overview...
Analyzing market trends and skill demands
Industry News
Loading latest industry news...
Finding relevant articles from the last 6 months