Description
Under the general direction of the I&O SRE & Technical Process Manager, the Edge Site
Reliability Engineer is responsible for driving continuous improvement in uptime, availability,
reliability, massive automation, and the evolution of systems to drive improving customer
experience. He/She works in close collaboration with peers in Digital Engineering,
Applications, Operations, Security, Enterprise Architecture, Process Management and
Global SRE to drive service evolution. The outcome is robust operational capabilities married
with product evolution that delivers results for the business.
The Edge SRE is also responsible for the availability of the digital applications in which
he/she is involved either in the cloud and on premises.
He/She might spend ~50% of their time in hands-on Operations within the Product Teams
for SLO attainment and their remaining time:
1. Engaging in healthy design debate to achieve a suitable balance between cost
2. optimization and reliability.
3. Providing expert consulting services within Product Teams to drive the minimization
4. of manual tasks through mass automation and self healing capabilities.
5. Leading the postmortem process and outcomes whilst encouraging blameless review
6. of defects / service impacts and identifying ways to improve.
7. Influencing the adjustment of the end-to-end operations, release processes and
8. technologies to drive attainment of SLO targets and increase product reliability.
Responsibilities
9. Guarantee the general system uptime, focus on availability to comply with the defined SLA, SLO and SLI.
10. Define metrics. As applications evolve over time, edge SRE is responsible for adapting the right SLI and SLO and identifying significant projects that result in substantial cost savings or revenues.
11. Spend <=50% of their time spent on hands-on Operational run activities (toil). The remaining 50% should be focused on reliability, performance and efficiency improvements for Products
12. Supports the Problem Management process and Root Cause Analysis following P1
incidents by promoting:
o Error budget control.
o Post mortem culture. Let’s learn from the errors.
o React under security breach and promote an incident protocol.
13. A strong relationship with the security and operation team to support continuous
improvement of security assessments regarding
o Patching.
o Vulnerabilities.
o Secrets/Keys/Certifications.
o Compliance (Agents/clients installed)
14. Release strategy. Defining the involved parts, creating guidelines for version
control and name conventions, recommended testing phases and releases.
15. Contribute to new demand assessment by providing technical validation of the
demand and is in charge of the reliability engineering component of the demand.
16. Continuous improvement functions as eliminating toil, learning through Chaos
engineering testing, creating and collaborating on improvement plans. Relation with
business continuity, helping with the assessment if it is required, doing or participating
in the DR design and reviewing the runbooks. Helping to prepare for Chaos
Engineering tests.
17. Participate in communication strategies, showing zone technical trends, reports of
his/her function and helping to prepare the training path for a new edge SRE with
recommended readings, practices and training if it is required.. Maintain and review technology solutions catalog. Providing early engagement consulting to discuss specific architectures and design
choices in detail, and to help validate assumptions with the help of targeted
prototypes To assist in ensuring that the Infrastructure & Operations practices & processes are
aligned with: LafargeHolcim business objectives and priorities (Health & Safety, Communication, Distribution Model, Innovation, ...) LafargeHolcim IT infrastructure strategy LafargeHolcim Identity Management Systems LafargeHolcim Business Systems LafargeHolcim IT Security Policies and Directives LafargeHolcim Demand, Project Portfolio and Finance Management Policies
and standards.
Position Requirements
Education/Qualifications
18. Bachelor’s Degree in Information Technology or related discipline.
19. Distinctive qualifications relating to his/her area of expertise.
20. Preferred AWS solution architect certification and/other public cloud service providers.
Experience
21. Experience working in devops team
22. Previous experience in this role is desirable
23. Proven experience collaborating in technical designs oriented to availability and
reliability.
24. Experience in using and integrating cloud solutions.
25. Experience designing for scalability, capacity planning and resource
management
26. At least 3 years or more of experience in Cloud and Devops teams,
27. At least 7 years experience in Applications, Infrastructure, Storage, Platforms
Knowledge and Skills
28. Well versed and proficient on Automation Tools for IaaS / PaaS services such
as:
Infrastructure as a Code (i.e Cloud Formation,Terraform, Azure RM ...
etc)
Cloud most used mark-up languages (YAML, JSON)
Configuration Management Tools (i.e AWS System Manager, Ansible,
Chef, Puppet ... etc)
Scripting for Operations (i.e Bash, PowerShell, Python.. etc).
Source Control Management (Git, bitbucket, gitlab, github)
CI/CD Orchestration Tools (i.e Bitbucket pipelines, Jenkins, CircleCI,
Github Actions, AWS Code Deploy, Azure DevOps....etc).
29. Proficient in Operation IaaS services on at least one cloud service provider (AWS - preferred, Azure or GCP) in following domains
Compute
Network
Storage
IAM
30. General awareness of containerization technologies (AWS EKS/ECS/ECR, Kubernetes, Docker, etc.)
31. General awareness of microservices/serverless technologies (Lambda, Azure Functions)
32. Infrastructure Monitoring Tools (i.e CloudWatch, New Relic, Data Dog, ELK, Telegraf, Checkmk, Alaloop, AppDynamics ... etc).
33. Vertical Knowledge of technology solutions and ability to learn, understand, and
34. Work quickly with new emerging technologies, methodologies, and solutions in the Cloud/IT technology space. He/ she might participate in proof of concepts and help or perform technical design if it is required.
Language Requirements
English and Spanish required
Travel Requirements
Occasional International travel
Other Information
35. Preferred technical/functional skills: Knowledge in Network and Security technologies ( SDWAN( Meraki, velocloud), MPLS, Cisco and Nexus switches, Cisco Routers, Cisco firewalls ( ASA, FPR) and Load Balancers i.e F5/Netscaler). Understanding of Converged Infrastructure(i.e VMware+Cisco UCS+EMC Storage) and HyperConverged (i.e Nutanix) Knowledge of Citrix solutions (namely XenApp) Understanding of VOIP and Call centre technologies / architectures /operations. Identity Access Management:. Understanding of Identity Lifecycle, access management, Identity Federation, provisioning, certification, governance, Active/Google Directory, MFA, Anti-virus and security, SAP-GRC, Sailpoints, Okta, Ping, etc.) General Distributed Systems Understanding (i.e DBaaS, Hadoop Based Systems, Kafka ... etc) Knowledge in relational databases (Oracle, MS SQL Server, PostgreSQL, MySQL) and non relational databases (MongoDB, Redshift, Coachbase....). Virtualization and Containerization Technologies (i.e Kubernetes, Docker, Tunzu, VMware on AWS ... etc) SAP Systems (BASIS administrators) Disaster recovery tools (Druva, CPM, etc.) End-to-End Monitoring tools (appdynamic, Dynatrace)