Senior machine learning operations engineer

Xebia

Publicada el Publicado hace 7 hr horas

Descripción

About Xebia

With over 20 years of experience, our global network of passionate technologists and pioneering craftsmen deliver cutting-edge technology and game-changing consulting to companies on the brink of transformation. Since 2001, we have grown from a Java company into a full-service digital consulting company with 5,500+ professionals working on a worldwide ambition.

We are organized in complementary service lines – teams with a tremendous amount of knowledge and experience within a particular field, such as Agile, DevOps, Data and AI, Cloud, Software Technology, Functional Programming, Intelligent Automation, and Microsoft.

We help the world’s top 250+ companies and category leaders overcome digital challenges, embrace innovation, adopt new technology, and implement new business models. In addition to high-quality consulting, we also provide offshoring and nearshoring services.

For more details, please visit

About the role

As an MLOps Platform Engineer at Xebia, you will serve as a Senior MLOps consultant embedded with our enterprise clients to transform their ML infrastructure operations. This role combines deep platform engineering expertise with client-focused delivery, working closely with other Xebians and client engineering teams to build and optimize foundational ML infrastructure. You'll operate within client environments that require exceptional technical independence and strong client relationship management skills.

Responsibilities:

* Architect large-scale build systems supporting the client's complex ML model training and deployment pipelines.
* Design and implement HPC integration solutions utilizing the Slurm REST API for the client's cloud-native ML workloads on Google Cloud Platform (GCP).
* Optimize performance for the client's containerized workloads, including large containers and long-running machine learning (ML) processes.
* Implement client infrastructure using Terraform on GCP, integrating with existing Vertex AI and GKE environments.
* Lead client infrastructure migration initiatives from Azure DevOps to GitHub Actions, working within client timelines and constraints.
* Collaborate closely with client engineering teams to ensure platform operations meet their specific business requirements and onboarding needs.

Requirements:

Basics:

* 5+ years of platform engineering experience with a proven client delivery track record in consulting environments.
* Strong expertise in GCP infrastructure and Vertex AI platform deployment.
* Proficiency in Python backend services development and optimization techniques.
* Hands-on experience with Kubernetes orchestration and GKE in enterprise environments.
* Proven experience with Infrastructure as Code using Terraform for client implementations.
* Experience delivering large-scale build systems and CI/CD pipeline optimization.
* Knowledge of containerization strategies for complex ML workloads and performance optimization.
* Experience with Git-driven development workflows and GitHub Actions migration projects.
* Strong client communication and stakeholder management skills for technical consulting engagements.
* Demonstrated ability to work independently within client environments.

Recommended:

* Understanding of HPC environments and job scheduling systems (Slurm REST API experience preferred).
* Experience with PyTorch deployment patterns and optimization in enterprise ML environments.
* Background in ML platform components implementation (Feature Stores, Model Templates, Job Services).
* Client-facing experience with database performance optimization for ML workloads.
* Understanding of LLM API integrations and GenAI infrastructure in regulated environments.

Enviar

Crear una alerta

Guardar