Groupon is a marketplace where customers discover new experiences and services every day, helping local businesses thrive. To date, we have worked with over a million merchant partners worldwide, connecting more than 16 million customers with deals across various categories. In a world often dominated by e-commerce giants, we stand out as one of the few platforms committed to supporting local businesses on a performance basis.
Groupon is on a transformative journey, relentlessly pursuing results. Despite thousands of employees across multiple continents, we maintain a culture that inspires innovation, rewards risk-taking, and celebrates success. Our scale allows for immediate impact, and our culture fosters autonomy and meaningful contributions at every level.
Principal Site Reliability Engineer
Role Overview :
Are you ready to elevate your expertise and impact the reliability and scalability of mission-critical systems? As a Principal Site Reliability Engineer (SRE Level V / VI), you will ensure the performance, availability, and resilience of our platforms. This role involves leading initiatives that redefine operational excellence, collaborating with diverse teams to implement advanced technologies and best practices, fostering a culture of reliability, and mentoring engineers. It’s an exceptional opportunity for those passionate about solving complex challenges and shaping platform reliability in a high-impact position.
Key Responsibilities :
1. Architect and maintain fault-tolerant systems, ensuring uptime SLAs of 99.9% or higher.
2. Drive automation in infrastructure management and deployment using Terraform, Ansible, Kubernetes, and similar tools.
3. Create and optimize CI / CD pipelines for reliable, secure, and efficient software delivery.
4. Build and enhance observability solutions, including monitoring, logging, and alerting systems using Prometheus, Grafana, and the ELK stack.
5. Collaborate with stakeholders to define and achieve SLIs, SLOs, and error budgets aligned with business needs.
6. Lead incident response during on-call rotations, ensuring rapid resolution and root cause analysis for critical issues.
7. Design and execute performance testing, capacity planning, and scalability strategies for evolving workloads.
8. Proactively identify and resolve bottlenecks to improve system performance and developer efficiency.
9. Mentor junior engineers, fostering a collaborative and growth-oriented environment.
10. Guide architectural decisions that drive innovation and enhance system reliability.
Qualifications :
1. 10+ years in systems engineering, with at least 5+ years in SRE or DevOps roles.
2. Expertise in cloud platforms (GCP, AWS) and container orchestration (Kubernetes, Docker).
3. Proficiency in programming and scripting languages like Python, Go, and Bash.
4. Advanced knowledge of Infrastructure as Code (IaC) tools such as Terraform and Ansible.
5. Deep understanding of networking, DNS, load balancing, and security principles.
6. Proven track record of managing high-availability systems in demanding environments.
7. Exceptional analytical and problem-solving skills.
Preferred Qualifications :
1. Certifications in cloud or container technologies (e.g., AWS / GCP / Azure, Kubernetes CKA).
2. Experience in industries like eCommerce, FinTech, or SaaS.
3. Familiarity with Agile development processes and frameworks.
What We Offer :
1. The opportunity to work with cutting-edge technologies in a transformative environment.
2. A collaborative and innovative work culture that values your expertise and contributions.
3. Professional growth and leadership development pathways tailored to your aspirations.
4. A chance to leave a lasting impact by shaping the future of reliable and scalable systems.
Join us to push the boundaries of platform reliability and drive meaningful change in a fast-evolving digital world!
#J-18808-Ljbffr