Remote Senior Site Reliability Engineer - Infrastructure
About the Role
We are looking for a Remote Senior Site Reliability Engineer to join our team at Underdog Sports. In this role, you will play a critical part in ensuring the reliability and scalability of our infrastructure as we continue to grow. As a founding member of the SRE team, you will help define our approach to operational excellence and reliability. This position offers a unique opportunity to make a significant impact from day one.
What You'll Do
- Own and maintain the incident response process, defining procedures, tools, and best practices.
- Guide teams in establishing and monitoring Service Level Objectives (SLOs), including setting up alerts and reporting systems.
- Lead capacity planning initiatives, focusing on scalability and performance during peak traffic and game-day spikes.
- Collaborate closely with platform, infrastructure, and product teams to enhance system reliability and developer experience.
- Identify high-leverage reliability challenges and shape our incident response strategy.
Requirements
- 5+ years of experience in Site Reliability Engineering or a related field.
- Strong understanding of incident response processes and best practices.
- Experience with monitoring and alerting tools.
- Proficiency in cloud infrastructure (AWS, GCP, or Azure).
- Excellent problem-solving skills and a proactive approach to challenges.
Nice to Have
- Familiarity with container orchestration (Kubernetes, Docker).
- Experience in capacity planning and performance tuning.
- Knowledge of programming/scripting languages (Python, Go, etc.).
What We Offer
- Competitive salary and performance-based bonuses.
- Flexible remote work environment.
- Health, dental, and vision insurance.
- Generous paid time off and holiday schedule.
- Opportunities for professional development and growth.
This Remote Senior Site Reliability Engineer role at Underdog Sports offers a unique opportunity to shape the company's reliability practices while enjoying competitive pay and flexible work arrangements.
Who Will Succeed Here
Proficient in managing cloud infrastructure across AWS, GCP, and Azure, with hands-on experience in deploying and maintaining scalable applications in Kubernetes and Docker environments.
Strong analytical mindset with a proven track record in incident response, demonstrating the ability to quickly diagnose and resolve complex system outages while implementing effective SLO monitoring strategies.
Self-motivated and comfortable working in a fully remote environment, exhibiting excellent time management skills to balance multiple priorities and deliver operational excellence without direct supervision.
Learning Resources
Career Path
Market Overview
Skills & Requirements
Domain Trends
Industry News
Loading latest industry news...
Finding relevant articles from the last 6 months