Remote Software Reliability Engineer for AI
About the Role
We are looking for a Remote Software Reliability Engineer for AI to join our innovative team at MixMode. In this role, you will enhance the reliability and performance of our AI systems, ensuring they operate seamlessly in dynamic environments.
What You'll Do
- Own the reliability, performance, and operational health of production AI systems, focusing on improving complex, existing services.
- Lead efforts to refactor and harden the AI codebase to improve observability, maintainability, and resilience.
- Diagnose and resolve issues across distributed systems, including latency, throughput, data pipelines, and resource utilization.
- Design and build monitoring, alerting, and debugging tools for high-availability services.
- Partner with researchers and ML engineers to productionize models at scale.
- Establish best practices for testing, deployment, capacity management, and incident response.
- Collaborate with cross-functional teams to ensure alignment on project goals and deliverables.
Requirements
- 5+ years of experience as a Software Reliability Engineer or similar role.
- Strong understanding of distributed systems, microservices architecture, and cloud technologies.
- Experience with Kubernetes, Docker, and CI/CD pipelines.
- Proficiency in programming languages such as Python, Go, or Java.
- Familiarity with machine learning concepts and frameworks.
- Excellent problem-solving skills and the ability to work in a fast-paced environment.
- Strong communication skills and the ability to work collaboratively.
Nice to Have
- Experience in AI or cybersecurity domains.
- Knowledge of observability tools like Prometheus, Grafana, or similar.
- Familiarity with agile methodologies.
What We Offer
- Competitive salary ranging from $140,000 to $180,000 per year.
- Fully remote work environment with flexible hours.
- Opportunities for professional growth and development.
- Health, dental, and vision insurance.
- Generous paid time off and holiday policy.
- Collaborative and innovative work culture.
- Access to cutting-edge technology and tools.
This role offers a unique opportunity to work at the intersection of AI and cybersecurity, with a focus on enhancing system reliability and performance in a fully remote environment.
Who Will Succeed Here
Proficient in Python and experienced with frameworks such as Flask or FastAPI to develop and maintain reliable AI services.
Strong understanding of Kubernetes and Docker for container orchestration and deployment, enabling efficient management of microservices in a cloud environment.
Experience with CI/CD pipelines and distributed systems, showcasing a proactive approach to automating testing and deployment processes in a remote work setting.
Learning Resources
Career Path
Market Overview
Skills & Requirements
Domain Trends
Industry News
Loading latest industry news...
Finding relevant articles from the last 6 months