Senior Site Reliability Engineer - Remote Opportunity
About the Role
Anduril Industries is seeking a passionate and experienced Senior Site Reliability Engineer to join our team remotely. In this role, you will be instrumental in building resilient, highly available systems that power our cutting-edge Lattice platform. As a Senior Site Reliability Engineer, you will work closely with platform engineering teams, product developers, and field operations to proactively identify reliability risks and implement strategies that enhance operational excellence.
What You’ll Do
- Design and implement comprehensive monitoring, observability, and alerting systems to ensure early detection of reliability issues across the Lattice platform.
- Drive incident response and conduct blameless postmortems to identify systemic improvements and prevent recurrence of production issues.
- Build and maintain infrastructure automation using tools like Terraform, Kubernetes, and custom tooling to manage large-scale distributed systems.
- Establish and track Service Level Objectives (SLOs) and Error Budgets to balance feature velocity with system reliability.
- Partner with software engineering teams to improve system architecture for reliability, implementing patterns like circuit breakers and chaos engineering.
- Develop capacity planning models and performance testing frameworks to ensure systems can handle growth and peak operational demands.
- Create runbooks, documentation, and training materials to enable teams to operate production systems effectively.
- Participate in on-call rotations and serve as an escalation point for critical production incidents.
Requirements
- 7+ years of engineering experience with at least 3+ years focused on SRE, production operations, or infrastructure engineering.
- Deep expertise with Kubernetes in production environments, including operational challenges at scale (100+ nodes).
- Strong programming skills in one or more languages such as Go, Python, Rust, or Java.
- Proven experience designing and implementing observability stacks using tools like Prometheus, Grafana, or ELK.
- Hands-on experience with cloud platforms (AWS, Azure, or GCP) and infrastructure as code practices.
- Demonstrated ability to debug complex distributed systems issues across multiple layers of the stack.
- Strong incident management and communication skills, with experience leading responses to critical outages.
- Must be a U.S. Person due to required access to U.S. export controlled information or facilities.
Nice to Have
- Experience with defense, aerospace, or other mission-critical systems.
- Knowledge of chaos engineering principles and experience implementing resilience testing frameworks.
- Familiarity with CI/CD platforms and deployment automation.
What We Offer
- Comprehensive medical, dental, and vision plans at little to no cost.
- Highly competitive PTO plans with a holiday hiatus in December.
- Access to free mental health resources 24/7.
- Annual reimbursement for professional development.
- Relocation assistance available depending on role eligibility.
This role offers a unique opportunity to impact national security while working remotely. With a competitive salary and comprehensive benefits, it stands out in the tech industry.
About Anduril Industries
Explore Anduril Industries careers in 2026 and discover exciting job opportunities across remote, hybrid, and office roles. Utilize our advanced filters to refine your search, track your applications, and gain valuable insights about the company. Whether you're looking for engineering, operations, or tech positions, find your ideal role at Anduril Industries and shape the future of defense technology.
Who Will Succeed Here
Proficiency in managing and orchestrating containerized applications using Kubernetes, with hands-on experience in deploying and scaling applications in cloud environments like AWS or GCP.
Strong automation mindset with extensive experience in Infrastructure as Code (IaC) tools like Terraform, enabling efficient and repeatable infrastructure deployment and management.
Deep understanding of monitoring and observability tools such as Prometheus and Grafana, coupled with a proactive approach to identifying and mitigating reliability risks in complex systems.
Learning Resources
Career Path
Market Overview
Skills & Requirements
Domain Trends
Industry News
Loading latest industry news...
Finding relevant articles from the last 6 months