Senior Site Reliability Engineer - Remote Opportunity
About the Role
Join Mistral AI as a Senior Site Reliability Engineer in this exciting remote opportunity. As a Senior Site Reliability Engineer, you will play a crucial role in shaping the reliability, scalability, and performance of our platform and customer-facing applications. This position allows you to work closely with our software engineers and research teams to ensure our systems meet and exceed our internal and external customers' expectations.
What You'll Do
- Design, build, and maintain scalable, highly available, and fault-tolerant infrastructures to support our web services and machine learning workloads.
- Ensure our platform, inference, and model training environments are always highly available and enable seamless replication of work environments across several HPC clusters.
- Operate systems and troubleshoot issues in production environments, including on-call responses and infrastructure scaling.
- Implement and improve monitoring, alerting, and incident response systems to ensure optimal system performance and minimize downtime.
- Drive continuous improvement in infrastructure automation, deployment, and orchestration using tools like Kubernetes and Terraform.
- Collaborate with AI/ML researchers to develop solutions that enable safe and reproducible model-training experiments.
- Document processes and procedures to ensure consistency and knowledge sharing across the team.
- Contribute to open-source projects, research publications, and blog articles.
Requirements
- Master’s degree in Computer Science, Engineering, or a related field.
- 7+ years of experience in a DevOps/SRE role.
- Strong experience with cloud computing and highly available distributed systems.
- Hands-on experience with CI/CD, containerization, and orchestration tools (Docker, Kubernetes).
- Proficiency in scripting languages (Python, Go, Bash) and knowledge of software development best practices.
- Excellent problem-solving and communication skills.
- Self-motivated and able to work well in a fast-paced startup environment.
Nice to Have
- Experience in an AI/ML environment.
- Experience with high-performance computing (HPC) systems and workload managers.
- Worked with modern AI-oriented solutions.
What We Offer
- Competitive salary and equity options.
- Health insurance coverage.
- Transportation and sport allowances.
- Meal vouchers and private pension plan.
- Generous parental leave policy.
- Visa sponsorship available.
This role offers a unique opportunity to work with cutting-edge AI technology in a fully remote setting. Mistral AI is committed to innovation and collaboration.
Who Will Succeed Here
Proficiency in Kubernetes and Docker orchestration for deploying and managing containerized applications, ensuring seamless scalability and reliability in cloud environments.
Strong experience with Terraform for infrastructure as code (IaC) to automate and manage cloud resources, demonstrating a proactive approach to system configuration and deployment.
A problem-solving mindset with a focus on monitoring and incident response, utilizing tools like Prometheus or Grafana to analyze system performance and ensure uptime in a fully remote work environment.
Learning Resources
Career Path
Market Overview
Skills & Requirements
Domain Trends
Industry News
Loading latest industry news...
Finding relevant articles from the last 6 months