Senior Site Reliability Engineer (SRE)
3 weeks ago Be among the first 25 applicants
At Roche, you can show up as yourself, embraced for the unique qualities you bring. Our culture encourages personal expression, open dialogue, and genuine connections, where you are valued, accepted, and respected for who you are, allowing you to thrive both personally and professionally. This is how we aim to prevent, stop, and cure diseases and ensure everyone has access to healthcare today and for generations to come. Join Roche, where every voice matters.
The Position
The role requires the candidate to be available for on-call duty service, responding promptly to urgent issues and emergencies outside of regular working hours, ensuring that critical situations are addressed in a timely and effective manner.
Who We Are
At Roche, we are passionate about transforming patients’ lives, and we are bold in both decision and action - we believe that good business means a better world. That is why we come to work every single day. We commit ourselves to scientific rigor, unassailable ethics, and access to medical innovations for all. We do this today to build a better tomorrow.
Roche is strongly committed to a diverse and inclusive workplace. We strive to build teams that represent a range of backgrounds, perspectives, and skills. Embracing diversity enables us to create a great place to work and to innovate for patients.
Roche is building a global site reliability engineering (SRE) team that will support commercial and internal solutions. This team will have the mindset of building and creating engineering solutions to solve a broad spectrum of problems.
Step into the Future of IT Infrastructure with Roche!
As a seasoned Site Reliability Engineer (SRE) at Roche, you'll leverage your deep software engineering expertise to propel our IT infrastructure to new heights of robustness, scalability, and reliability. This isn't just a role—it's an invitation to shape the backbone of critical infrastructures and drive our technological innovations forward.
Your Mission
* Design and maintain cutting-edge tools, scripts, and frameworks that automate repetitive tasks, streamline software deployment, and manage expansive systems with unparalleled efficiency.
* Partner closely with development teams to architect and implement high-performance solutions that elevate system efficiency, optimize resource utilization, and enhance deployment processes for superior uptime and user satisfaction.
Your Impact
* Lead incident management and response. Detect system anomalies, troubleshoot swiftly, and conduct root cause analyses to prevent recurring issues.
* Refine monitoring and alerting mechanisms, conduct post-incident reviews, and embed best practices in software lifecycle management to ensure reliability and performance.
By joining our team, you will play a pivotal role in delivering seamless experiences to end-users, exceeding business and customer demands, and solidifying Roche's reputation as a leader in IT innovation.
Your Core Responsibilities
* Proactively monitor and maintain system reliability using tools like DataDog, VictorOps, ELK, Grafana, and Prometheus. Ensure system stability and performance.
* Ensure optimal uptime and performance by swiftly identifying issues and responding to alerts.
* Troubleshoot complex technical issues, collaborate with engineering teams for resolutions.
* Maintain and achieve SLAs, SLIs, and SLOs.
* Develop automation scripts (Python or similar) to streamline operations.
* Manage cloud infrastructure across AWS and Azure, implement best practices, and optimize costs.
* Collaborate with engineering, DevOps, security, and operations teams.
* Handle incidents via JIRA and ServiceNow, documenting procedures and lessons learned.
* Work on-call outside normal hours and contribute to team growth and resilience.
Who You Are
* Bachelor's degree in Computer Science, Engineering, or related field, or equivalent experience. Advanced degrees are a plus.
* Relevant certifications (AWS/Azure).
* Approximately 5 years of experience in SRE, IT operations, DevOps, or related fields.
* Solid experience with AWS/Azure, Kubernetes, Terraform, monitoring tools, scripting, incident response, troubleshooting, and teamwork.
* Excellent communication skills in English.
Why Join Us?
Join our dynamic environment where your contributions impact service reliability. Opportunities for growth and collaboration with industry leaders await you. Let's drive IT stability together for an exceptional customer experience.
Ready to make a difference? Apply now to be our next SRE Incident Manager and help us build a more reliable future!
Who we are
A healthier future drives us to innovate. Over 100,000 employees worldwide are dedicated to advancing science, ensuring access to healthcare, and delivering life-changing solutions. Join us in building a healthier future, together.
Roche is an Equal Opportunity Employer.
#J-18808-Ljbffr