AI SCORE 8.5 / 10

Principal DevOps/SRE Engineer - Remote Opportunity

$120K–$150K/year

AWS•Kubernetes•Terraform•Prometheus•Grafana•GitLab CI•Github-action•AIOps

About the Role

We are seeking a Principal DevOps/SRE Engineer to join our team in a remote capacity. This role is pivotal in building and owning our reliability practice end-to-end. As a Principal DevOps/SRE Engineer, you will not only respond to incidents but will also formalize effective processes, automate repetitive tasks, and lay the groundwork for enterprise-grade SRE as ELSA expands its B2B footprint.

What You'll Do

Own the SRE practice: define severity tiers (P1–P4), formalize on-call rotation, build SLA tracking dashboards, and establish incident management workflows across a team of 4 DevOps engineers.
Build runbooks for the top recurring operational issues — pod scaling, deploy rollbacks, access management, EKS upgrades, CI/CD pipeline failures — and automate L1/L2 responses using tools like Shoreline.io, Rundeck, or PagerDuty automation.
Introduce and operationalize AI-assisted DevOps tooling: AIOps for alert correlation, CastAI/Kubecost for cost optimization, GitHub Copilot for IaC acceleration. Train the existing team on these tools.
Drive infrastructure modernization: EKS upgrades, Karpenter migration, observability (SigNoz/Prometheus), secrets management (ArgoCD/SOPS), and Terraform-based IaC maturity.
Collaborate with AI Engineering, Mobile, and B2B teams to ensure infrastructure supports real-time speech processing, GPU workloads, and multi-region enterprise deployments.
Design and plan round-the-clock SRE coverage model as B2B enterprise SLA commitments grow — evaluate vendor partnerships or strategic hires for Americas timezone coverage.

Requirements

2+ years in DevOps/SRE, with at least 2 years in a principal or staff-level role owning reliability practices for a production SaaS product.
Deep hands-on experience with AWS (EKS, EC2, DynamoDB, S3, IAM, Secrets Manager), Kubernetes (HPA, KEDA, Karpenter, pod scheduling, GPU workloads), and IaC (Terraform, Helm, ArgoCD).
Track record of building runbooks, on-call rotations, and incident management frameworks — not just participating in them.
Experience with observability stacks (Prometheus, Grafana, SigNoz or Datadog), CI/CD (GitLab CI, GitHub Actions), and alerting (PagerDuty, Opsgenie).
Comfort working across timezones with distributed teams (India, Vietnam, Portugal).
Strong written communication skills — you'll be writing runbooks, RCAs, and proposals as much as Terraform.

Nice to Have

Experience with AI/ML infrastructure (GPU scheduling, model serving, real-time audio/speech workloads).
Familiarity with compliance frameworks (ISO 27001, SOC 2, Vanta) in a DevOps context.
Hands-on experience with AIOps tooling, automated remediation platforms (Shoreline, Rundeck), or FinOps tools (CastAI, Kubecost).

What We Offer

Flexible work setup: Remote-first for Singapore, India, Indonesia, Malaysia; hybrid model for Vietnam.
Comprehensive employee well-being benefits.
Free ELSA Premium courses to polish your language skills.
Collaborative, international team culture.
Opportunity to contribute to a fast-growing, well-funded Silicon Valley startup with global impact.

Why This Job8.5 of 10

This Principal DevOps/SRE Engineer role at ELSA offers a unique opportunity to lead reliability practices in a remote setting, contributing to innovative AI-driven language learning solutions.

Salary Range

Required

0/1

Optional

0/1

Bonus

0/1

Generating success profile...

Analyzing job requirements and market data

Loading market overview...

Analyzing market trends and skill demands

Industry News

Loading latest industry news...

Finding relevant articles from the last 6 months

All job postings are automatically gathered by algorithms. We do not review or verify listings, be careful when applying and do not sign-in with iCloud or Google services.

Principal DevOps/​SRE Engineer - Remote Opportunity