ELSA, Corp17.04.26
AI SCORE 8.5

Principal DevOps/​SRE Engineer - Remote Opportunity

$120K–$150K/year

About the Role

We are seeking a Principal DevOps/SRE Engineer to join our team in a remote capacity. This role is pivotal in building and owning our reliability practice end-to-end. As a Principal DevOps/SRE Engineer, you will not only respond to incidents but will also formalize effective processes, automate repetitive tasks, and lay the groundwork for enterprise-grade SRE as ELSA expands its B2B footprint.

What You'll Do

  • Own the SRE practice: define severity tiers (P1–P4), formalize on-call rotation, build SLA tracking dashboards, and establish incident management workflows across a team of 4 DevOps engineers.
  • Build runbooks for the top recurring operational issues — pod scaling, deploy rollbacks, access management, EKS upgrades, CI/CD pipeline failures — and automate L1/L2 responses using tools like Shoreline.io, Rundeck, or PagerDuty automation.
  • Introduce and operationalize AI-assisted DevOps tooling: AIOps for alert correlation, CastAI/Kubecost for cost optimization, GitHub Copilot for IaC acceleration. Train the existing team on these tools.
  • Drive infrastructure modernization: EKS upgrades, Karpenter migration, observability (SigNoz/Prometheus), secrets management (ArgoCD/SOPS), and Terraform-based IaC maturity.
  • Collaborate with AI Engineering, Mobile, and B2B teams to ensure infrastructure supports real-time speech processing, GPU workloads, and multi-region enterprise deployments.
  • Design and plan round-the-clock SRE coverage model as B2B enterprise SLA commitments grow — evaluate vendor partnerships or strategic hires for Americas timezone coverage.

Requirements

  • 2+ years in DevOps/SRE, with at least 2 years in a principal or staff-level role owning reliability practices for a production SaaS product.
  • Deep hands-on experience with AWS (EKS, EC2, DynamoDB, S3, IAM, Secrets Manager), Kubernetes (HPA, KEDA, Karpenter, pod scheduling, GPU workloads), and IaC (Terraform, Helm, ArgoCD).
  • Track record of building runbooks, on-call rotations, and incident management frameworks — not just participating in them.
  • Experience with observability stacks (Prometheus, Grafana, SigNoz or Datadog), CI/CD (GitLab CI, GitHub Actions), and alerting (PagerDuty, Opsgenie).
  • Comfort working across timezones with distributed teams (India, Vietnam, Portugal).
  • Strong written communication skills — you'll be writing runbooks, RCAs, and proposals as much as Terraform.

Nice to Have

  • Experience with AI/ML infrastructure (GPU scheduling, model serving, real-time audio/speech workloads).
  • Familiarity with compliance frameworks (ISO 27001, SOC 2, Vanta) in a DevOps context.
  • Hands-on experience with AIOps tooling, automated remediation platforms (Shoreline, Rundeck), or FinOps tools (CastAI, Kubecost).

What We Offer

  • Flexible work setup: Remote-first for Singapore, India, Indonesia, Malaysia; hybrid model for Vietnam.
  • Comprehensive employee well-being benefits.
  • Free ELSA Premium courses to polish your language skills.
  • Collaborative, international team culture.
  • Opportunity to contribute to a fast-growing, well-funded Silicon Valley startup with global impact.
Why This Job8.5 of 10

This Principal DevOps/SRE Engineer role at ELSA offers a unique opportunity to lead reliability practices in a remote setting, contributing to innovative AI-driven language learning solutions.

Salary Range
Required
0/1
Optional
0/1
Bonus
0/1

Generating success profile...

Analyzing job requirements and market data

Loading market overview...

Analyzing market trends and skill demands

Industry News

Loading latest industry news...

Finding relevant articles from the last 6 months

All job postings are automatically gathered by algorithms. We do not review or verify listings, be careful when applying and do not sign-in with iCloud or Google services.