Senior Software Reliability Engineer for AI - Remote Position
About the Role
We are seeking a Senior Software Reliability Engineer for AI to join our innovative team at MixMode. In this remote position, you will play a crucial role in enhancing the reliability and performance of our AI systems, ensuring they operate seamlessly in dynamic environments. This role is pivotal for organizations that rely on AI for cybersecurity solutions, making it a unique opportunity to impact the industry significantly.
What You'll Do
- Own the reliability, performance, and operational health of production AI systems, focusing on improving complex, existing services.
- Lead efforts to refactor and harden the AI codebase to improve observability, maintainability, and resilience.
- Diagnose and resolve issues across distributed systems, including latency, throughput, data pipelines, and resource utilization.
- Design and build monitoring, alerting, and debugging tools for high-availability services.
- Partner with researchers and ML engineers to productionize models at scale.
- Establish best practices for testing, deployment, capacity planning, and incident response.
Requirements
- 5+ years of experience as a Software Reliability Engineer or similar role.
- Strong understanding of distributed systems, Kubernetes, and cloud environments.
- Experience with AI/ML systems and their operational challenges.
- Proficient in programming languages such as Python, Go, or Java.
- Familiarity with monitoring tools and practices (e.g., Prometheus, Grafana).
- Excellent problem-solving skills and ability to work collaboratively with cross-functional teams.
Nice to Have
- Experience in cybersecurity or related fields.
- Knowledge of data pipeline technologies (e.g., Apache Kafka, Spark).
- Familiarity with CI/CD practices and tools.
What We Offer
- Competitive salary ranging from $140,000 to $180,000 annually.
- Flexible remote work environment.
- Opportunities for professional development and growth.
- Collaborative and innovative team culture.
- Health and wellness benefits.
This role offers a unique opportunity to work at the intersection of AI and cybersecurity, focusing on enhancing system reliability in a fully remote environment.
Who Will Succeed Here
Deep expertise in Kubernetes orchestration and management, with a proven track record of deploying and maintaining containerized applications in production environments.
Strong programming skills in Python and Go, with the ability to write efficient, scalable code for monitoring and reliability tooling in distributed systems.
A proactive mindset for continuous improvement, with experience implementing monitoring tools and practices that enhance system reliability and performance in a remote work setting.
Learning Resources
Career Path
Market Overview
Skills & Requirements
Domain Trends
Industry News
Loading latest industry news...
Finding relevant articles from the last 6 months