Principal Software Engineer - AI Infrastructure Innovation (Remote)
About the Role
We are seeking a Principal Software Engineer - AI Infrastructure Innovation to join our team at Oracle. This remote position offers the chance to work on pioneering AI and HPC networking solutions for GPU superclusters at a massive scale. You will play a critical role in designing and delivering state-of-the-art RDMA-based networking that enables high performance for AI training and inference.
What You'll Do
- Lead architecture, system design, and implementation for high-performance RDMA solutions across OCI’s AI/HPC platforms.
- Innovate on network and TCP performance, identifying necessary changes across Kernel, NIC, switch, transport, protocol, storage, and GPU communications.
- Develop production-grade, high-performance software features with a focus on reliability, observability, and security.
- Define performance goals and success metrics; design benchmarks and conduct large-scale experiments to validate throughput, latency, and tail behavior.
- Collaborate with GPU platform, storage, database, and control-plane teams to deliver end-to-end solutions and influence OCI-wide network architecture and standards.
- Mentor engineers, provide technical leadership and reviews, and contribute to the long-term roadmap and technical strategy.
Requirements
- Strong software engineering background with a deep understanding of data structures and algorithms.
- Demonstrated ability to optimize for high scale, low latency, and high throughput in large-scale systems.
- Experience in developing, shipping, and operating high-performance production code.
- Ability to lead technically, mentor others, and deliver results in complex problem spaces.
- BS/MS in Computer Science, Electrical/Computer Engineering, or equivalent practical experience.
Nice to Have
- Experience with RDMA networking (RoCE and/or InfiniBand).
- Familiarity with AI/HPC stacks and workloads: NCCL/RCCL/MPI, Slurm, and GPU communication patterns.
- Experience integrating GPU Direct and NVMe-oF access in production.
- Hands-on experience with observability and performance tooling (e.g., eBPF, perf, flame graphs).
What We Offer
- Comprehensive benefits package including medical, dental, and vision insurance.
- 401(k) Savings and Investment Plan with company match.
- Flexible paid time off and 11 paid holidays.
- Paid parental leave and adoption assistance.
- Employee Stock Purchase Plan and financial planning services.
This role offers a unique opportunity to lead innovative AI infrastructure projects at Oracle, with a competitive salary and comprehensive benefits.
Who Will Succeed Here
Expert in RDMA and HPC technologies, with a proven track record of optimizing performance and designing scalable networking solutions for AI workloads, particularly in GPU superclusters.
Strong self-motivated individual who thrives in a remote work environment, demonstrating exceptional time management skills and the ability to independently drive complex projects to completion.
Deep understanding of observability tools and performance tuning techniques, combined with a mindset focused on continuous learning and adaptation to new AI technologies and infrastructure innovations.
Learning Resources
Career Path
Market Overview
Skills & Requirements
Domain Trends
Industry News
Loading latest industry news...
Finding relevant articles from the last 6 months