AI SCORE 8.5 / 10

Senior Site Reliability Engineer - Remote Opportunity

$166K–$220K/year

Kubernetes•Terraform•AWS•Azure•GCP•Prometheus•Grafana•ELK•Go•Python•Rust•Java

About the Role

Anduril Industries is seeking a passionate and experienced Senior Site Reliability Engineer to join our team remotely. In this role, you will be instrumental in building resilient, highly available systems that power our cutting-edge Lattice platform. As a Senior Site Reliability Engineer, you will work closely with platform engineering teams, product developers, and field operations to proactively identify reliability risks and implement strategies that enhance operational excellence.

What You’ll Do

Design and implement comprehensive monitoring, observability, and alerting systems to ensure early detection of reliability issues across the Lattice platform.
Drive incident response and conduct blameless postmortems to identify systemic improvements and prevent recurrence of production issues.
Build and maintain infrastructure automation using tools like Terraform, Kubernetes, and custom tooling to manage large-scale distributed systems.
Establish and track Service Level Objectives (SLOs) and Error Budgets to balance feature velocity with system reliability.
Partner with software engineering teams to improve system architecture for reliability, implementing patterns like circuit breakers and chaos engineering.
Develop capacity planning models and performance testing frameworks to ensure systems can handle growth and peak operational demands.
Create runbooks, documentation, and training materials to enable teams to operate production systems effectively.
Participate in on-call rotations and serve as an escalation point for critical production incidents.

Requirements

7+ years of engineering experience with at least 3+ years focused on SRE, production operations, or infrastructure engineering.
Deep expertise with Kubernetes in production environments, including operational challenges at scale (100+ nodes).
Strong programming skills in one or more languages such as Go, Python, Rust, or Java.
Proven experience designing and implementing observability stacks using tools like Prometheus, Grafana, or ELK.
Hands-on experience with cloud platforms (AWS, Azure, or GCP) and infrastructure as code practices.
Demonstrated ability to debug complex distributed systems issues across multiple layers of the stack.
Strong incident management and communication skills, with experience leading responses to critical outages.
Must be a U.S. Person due to required access to U.S. export controlled information or facilities.

Nice to Have

Experience with defense, aerospace, or other mission-critical systems.
Knowledge of chaos engineering principles and experience implementing resilience testing frameworks.
Familiarity with CI/CD platforms and deployment automation.

What We Offer

Comprehensive medical, dental, and vision plans at little to no cost.
Highly competitive PTO plans with a holiday hiatus in December.
Access to free mental health resources 24/7.
Annual reimbursement for professional development.
Relocation assistance available depending on role eligibility.

Language Requirements

EnglishC1

BasicIntermediateAdvancedNative

Why This Job8.5 of 10

This role offers a unique opportunity to impact national security while working remotely. With a competitive salary and comprehensive benefits, it stands out in the tech industry.

Salary Range

Required

0/1

Optional

0/1

Bonus

0/1

About Anduril Industries

Explore Anduril Industries careers in 2026 and discover exciting job opportunities across remote, hybrid, and office roles. Utilize our advanced filters to refine your search, track your applications, and gain valuable insights about the company. Whether you're looking for engineering, operations, or tech positions, find your ideal role at Anduril Industries and shape the future of defense technology.

Industry

Tech

Location

Remote

Who Will Succeed Here

→

Proficiency in managing and orchestrating containerized applications using Kubernetes, with hands-on experience in deploying and scaling applications in cloud environments like AWS or GCP.

→

Strong automation mindset with extensive experience in Infrastructure as Code (IaC) tools like Terraform, enabling efficient and repeatable infrastructure deployment and management.

→

Deep understanding of monitoring and observability tools such as Prometheus and Grafana, coupled with a proactive approach to identifying and mitigating reliability risks in complex systems.

Learning Resources

→Kubernetes Official Documentationguide

→Terraform on AWS: Getting Startedcourse

→Monitoring Kubernetes with Prometheus and Grafanaarticle

Career Path

Senior Site Reliability Engineer(Now)→Lead Site Reliability Engineer(1-2 years)→Site Reliability Engineering Manager(3-5 years)

Market Overview

Market Size 2024

$10.5B

Annual Growth

35.7%

AI Adoption in SRE

45%

Investment in Kubernetes Solutions

+150%

Labour Demand for SRE Roles

+25%

Avg Salary for Senior SRE

$130K

Skills & Requirements

Required

KubernetesTerraformAWS

Growing in Demand

Service Mesh (e.g., Istio)Observability Tools (e.g., OpenTelemetry)Container Security (e.g., Aqua Security)

Declining

Traditional Virtualization (e.g., VMware)Bash Scripting

Domain Trends

Increased Adoption of GitOps

Organizations are shifting towards GitOps practices, with 60% of companies adopting these methodologies for managing Kubernetes deployments.

Rise of Multi-Cloud Environments

Over 70% of enterprises are leveraging multi-cloud strategies, necessitating skills in managing Kubernetes across AWS, Azure, and GCP.

Focus on Automation and CI/CD

80% of companies are investing in CI/CD pipelines integrated with Kubernetes, highlighting the need for automation skills in SRE roles.

Industry News

Loading latest industry news...

Finding relevant articles from the last 6 months

All job postings are automatically gathered by algorithms. We do not review or verify listings, be careful when applying and do not sign-in with iCloud or Google services.