About Avaya
Avaya is an enterprise software leader that helps the world's largest organizations and government agencies forge unbreakable connections.
The Avaya Infinity platform unifies fragmented customer experiences, connecting the channels, insights, technologies, and workflows that together create enduring customer and employee relationships.
We believe success is built through strong connections – with each other, with our work, and with our mission. At Avaya, you'll find a community that values your contributions and supports your growth every step of the way.
Learn more at
Description
Role Overview
We are seeking a Site Reliability Engineer (SRE) who will drive stability, reliability, and performance across our Azure-based platforms. This role blends operational excellence, proactive incident management, and strong collaboration with DevOps, Cloud, and Security teams.
The idóneo candidate will have hands-on experience with Azure, IaC (Terraform/Ansible), CI/CD (Jenkins/GitHub Actions), and monitoring systems, while also contributing to governance, cost optimization, and automation strategies that reduce toil and prevent issues before they occur.
This position includes 24x7 support coverage (rotational) and requires strong ownership in managing major incidents, RCA processes, and continuous service improvements.
Key Responsibilities
Reliability & Incident Management
- Serve as a key member of the 24x7 on-call rotation, responding to and managing incidents across production and pre-production environments.
- Lead incident bridges, coordinate root cause analysis (RCA), and ensure post-incident reviews drive systemic improvements.
- Maintain clear communication with cross-functional teams and leadership during major incidents.
Monitoring, Alerts & Prevention
- Build, tune, and maintain observability dashboards (Azure Monitor, Prometheus, Grafana, Datadog, Log Analytics).
- Define SLOs, SLIs, and error budgets to proactively identify and mitigate risks before customer impact.
- Continuously enhance alert quality, reduce false positives, and automate runbooks for faster recovery.
- Analyze trends to prevent recurring issues and support teams in resilience engineering.
Governance & Cost Management
- Support cloud governance frameworks—ensuring resource tagging, naming conventions, policy compliance, and operational guardrails.
- Work with FinOps and DevOps teams to track, optimize, and report cost efficiency across Azure subscriptions.
- Participate in change, release, and CAB reviews to ensure reliability, compliance, and readiness.
Infrastructure Awareness & Automation
- Understand IaC designs (Terraform, Ansible) and deployment workflows, leveraging automation for operational efficiency.
- Collaborate with DevOps and platform teams to reduce manual effort and "toil" in infrastructure operations.
- Implement self-healing mechanisms and automate recurring operational tasks where feasible.
- Ensure consistency and compliance through integration with CI/CD and policy-as-code systems.
Security & Compliance
- Embed DevSecOps practices in daily operations—monitor vulnerabilities, patch non-compliant resources, and validate certificate rotations.
- Support FIPS, FedRAMP, PCI, and CIS control implementations in cloud and containerized environments.
Collaboration & Agile Practices
- Partner with engineering, QA, and product teams to align reliability goals with delivery outcomes.
- Participate in agile ceremonies and advocate for SRE principles—"measure everything, automate where possible, and reduce toil."
- Document runbooks, playbooks, and operational processes to improve team efficiency and knowledge sharing.
Requirements
Required Skills & Experience
- 5+ years in Site Reliability, DevOps, or Cloud Operations roles.
- Proven expertise in Azure cloud operations and distributed system reliability.
- Strong understanding of Terraform, Ansible, and CI/CD pipelines (Jenkins, GitHub Actions).
- Experience with observability tools (Azure Monitor, Grafana, Prometheus, Datadog, or similar).
- Solid grasp of incident management frameworks (P1–P3 handling, RCA, PIRs, on-call rotations).
- Familiarity with governance, cost management, and security best practices in multi-cloud environments.
- Knowledge of containerized deployments (AKS/Kubernetes) and networking fundamentals.
- Excellent analytical, troubleshooting, and communication skills.
Desired Behaviours
- Proactive Prevention: Identifies risks before they escalate into incidents.
- Accountability: Owns service reliability and communicates with clarity.
- Collaboration: Works seamlessly with platform, DevOps, and product teams.
- Efficiency: Focuses on automation to reduce manual effort and improve MTTR.
- Continuous Improvement: Learns from failures, iterates processes, and improves documentation.
- Security & Governance Mindset: Balances agility with control and compliance.
Experience
3 years experience at the Engineer Two level or 5 – 8 years total experience
Education
Bachelor degree or equivalent experience Master degree or equivalent experience
Footer
Applicants must be currently authorized to work in the United States without the need for visa sponsorship now or in the future.
Avaya is an Equal Opportunity employer and a U.S. Federal Contractor. Our commitment to equality is a core value of Avaya. All qualified applicants and employees receive equal treatment without consideration for race, religion, sex, age, sexual orientation, gender identity, national origin, disability, status as a protected veteran or any other protected characteristic. In general, positions at Avaya require the ability to communicate and use office technology effectively. Physical requirements may vary by assigned work location. This job brief/description is subject to change. Nothing in this job description restricts Avaya right to alter the duties and responsibilities of this position at any time for any reason.