Principal DevOps/SRE Engineer - Remote Opportunity
About the Role
We are seeking a Principal DevOps/SRE Engineer to join our team in a remote capacity. This role is pivotal in building and owning our reliability practice end-to-end. As a Principal DevOps/SRE Engineer, you will not only respond to incidents but will also formalize effective processes, automate repetitive tasks, and lay the groundwork for enterprise-grade SRE as ELSA expands its B2B footprint.
What You'll Do
- Own the SRE practice: define severity tiers (P1–P4), formalize on-call rotation, build SLA tracking dashboards, and establish incident management workflows across a team of 4 DevOps engineers.
- Build runbooks for the top recurring operational issues — pod scaling, deploy rollbacks, access management, EKS upgrades, CI/CD pipeline failures — and automate L1/L2 responses using tools like Shoreline.io, Rundeck, or PagerDuty automation.
- Introduce and operationalize AI-assisted DevOps tooling: AIOps for alert correlation, CastAI/Kubecost for cost optimization, GitHub Copilot for IaC acceleration. Train the existing team on these tools.
- Drive infrastructure modernization: EKS upgrades, Karpenter migration, observability (SigNoz/Prometheus), secrets management (ArgoCD/SOPS), and Terraform-based IaC maturity.
- Collaborate with AI Engineering, Mobile, and B2B teams to ensure infrastructure supports real-time speech processing, GPU workloads, and multi-region enterprise deployments.
- Design and plan round-the-clock SRE coverage model as B2B enterprise SLA commitments grow — evaluate vendor partnerships or strategic hires for Americas timezone coverage.
Requirements
- 2+ years in DevOps/SRE, with at least 2 years in a principal or staff-level role owning reliability practices for a production SaaS product.
- Deep hands-on experience with AWS (EKS, EC2, DynamoDB, S3, IAM, Secrets Manager), Kubernetes (HPA, KEDA, Karpenter, pod scheduling, GPU workloads), and IaC (Terraform, Helm, ArgoCD).
- Track record of building runbooks, on-call rotations, and incident management frameworks — not just participating in them.
- Experience with observability stacks (Prometheus, Grafana, SigNoz or Datadog), CI/CD (GitLab CI, GitHub Actions), and alerting (PagerDuty, Opsgenie).
- Comfort working across timezones with distributed teams (India, Vietnam, Portugal).
- Strong written communication skills — you'll be writing runbooks, RCAs, and proposals as much as Terraform.
Nice to Have
- Experience with AI/ML infrastructure (GPU scheduling, model serving, real-time audio/speech workloads).
- Familiarity with compliance frameworks (ISO 27001, SOC 2, Vanta) in a DevOps context.
- Hands-on experience with AIOps tooling, automated remediation platforms (Shoreline, Rundeck), or FinOps tools (CastAI, Kubecost).
What We Offer
- Flexible work setup: Remote-first for Singapore, India, Indonesia, Malaysia; hybrid model for Vietnam.
- Comprehensive employee well-being benefits.
- Free ELSA Premium courses to polish your language skills.
- Collaborative, international team culture.
- Opportunity to contribute to a fast-growing, well-funded Silicon Valley startup with global impact.
This Principal DevOps/SRE Engineer role at ELSA offers a unique opportunity to lead reliability practices in a remote setting, contributing to innovative AI-driven language learning solutions.
Generating success profile...
Analyzing job requirements and market data
Loading market overview...
Analyzing market trends and skill demands
Industry News
Loading latest industry news...
Finding relevant articles from the last 6 months