Description
As a Senior Observability Engineer at Kyndryl’s AI Innovation Hub, you’ll be at the core of operational excellence for next-generation intelligent and agentic systems.
Your mission will be to design, implement, and maintain advanced observability and monitoring capabilities that ensure the reliability, traceability, and performance of AI agents and models in production.
You’ll help build the observability architecture for agentic intelligence — integrating tracing, logging, monitoring, and governance tools that provide a deep understanding of how agents perceive, reason, and act in complex environments.
Your work will enable early detection of anomalies, data drift, performance degradation, bias, or undesired agent behavior, ensuring compliance with the EU AI Act and Responsible AI principles.
If you’re passionate about bridging AI systems with operational intelligence, and about creating frameworks that make AI transparent, accountable, and trustworthy, this role offers a unique opportunity to shape the future of intelligent observability.
Your Mission
- Design and implement theobservability architecturefor AI and Agentic systems, enabling end-to-end visibility across models, agents, and data pipelines.
- Developinstrumentation frameworksto collect and analyze technical, behavioral, and cognitive metrics for deployed AI systems.
- Integrate and configuremonitoring, tracing, and logging tools(Prometheus, Grafana, OpenTelemetry, ELK Stack, Datadog, etc.) to ensure full operational insight.
- Builddashboards and alerting mechanismsto detect data drift, performance issues, hallucinations, or reasoning inconsistencies in LLMs and agents.
- Collaborate with MLOps, Data, and Architecture teams to establishmodel lineage, drift detection, and governance pipelines.
- Design and maintaincustom metricsfor model and agent reliability — precision, latency, cost, reasoning depth, autonomy, and consistency.
- Contribute to theResponsible AI framework, ensuring transparency, fairness, and auditability in AI decision-making.
- Continuously research and experiment with new observability tools and practices (AgentOps, LLMOps, RAG Observability).
Who You Are
Essential Qualifications
- 4+ years of professional experience, including at least 2 years in AI, MLOps, or distributed systems projects.
- Proven experience designing and implementing monitoring, logging, and performance metrics for production systems.
- Hands-on expertise with observability tools such as Prometheus, Grafana, OpenTelemetry, ELK Stack, Loki, Jaeger, or Datadog.
- Experience instrumenting AI and ML pipelines, tracking inference latency, throughput, and cost metrics.
- Familiarity with MLOps and LLMOps frameworks, including model traceability, drift detection, and prompt or reasoning tracing.
- Knowledge of agentic frameworks (LangGraph, AutoGen, CrewAI, OpenDevin, Google ADK) and their monitoring needs.
- Experience designing custom metrics for precision, reliability, error rate, and cognitive consistency.
- Strong understanding of cloud-native architectures, containers, and IaC tools (Kubernetes, Docker, Helm, Terraform).
- Awareness of AI compliance and governance requirements (EU AI Act, Responsible AI, decision traceability).
Education & Certifications
- Bachelor’s degree in Computer Engineering, Software Engineering, Data Science, or related field.
- Postgraduate or specialized training in MLOps, DevOps, Observability, or Artificial Intelligence is highly valued.
- Certifications in Cloud Architecture, Monitoring, or AI Governance are a plus.
- Continuous learning mindset and commitment to staying current with emerging AI observability frameworks.
Preferred Skills
- Experience with model observability and data lineage systems.
- Understanding of cognitive observability, including reasoning-chain or decision-path tracing in agents.
- Familiarity with event-driven architectures and telemetry for real-time AI services.
- Knowledge of FinOps metrics and cost optimization for AI workloads.
- Experience developing custom dashboards or visualization plugins for monitoring complex systems.
- Comfort working in hybrid or multi-cloud environments (Azure, AWS, GCP).
- Strong interest in AI reliability engineering and the convergence of AI and DevOps practices.
Soft Skills
- Analytical and systemic thinker, understanding the interplay between data, systems, and agent behavior.
- Clear communicator, able to convey complex insights and performance findings to both technical and business audiences.
- Quality- and reliability-driven, with a preventive mindset focused on operational resilience.
- Collaborative and cross-functional, working seamlessly with AI, data, and compliance teams.
- Curious and proactive, exploring emerging technologies and methods in AI observability and AgentOps.
- Ethical and responsible, aware of the implications and accountability of automated decisions in production AI.
#AgenticAI
Being YouDiversity is a whole lot more than what we look like or where we come from, it’s how we think and who we are.
We welcome people of all cultures, backgrounds, and experiences.
But we’re not doing it single-handily: Our Kyndryl Inclusion Networks are only one of many ways we create a workplace where all Kyndryls can find and provide support and advice.
This dedication to welcoming everyone into our company means that Kyndryl gives you – and everyone next to you – the ability to bring your whole self to work, individually and collectively, and support the activation of our equitable culture.
That’s the Kyndryl Way.
What You Can ExpectWith state-of-the-art resources and Fortune 100 clients, every day is an opportunity to innovate, build new capabilities, new relationships, new processes, and new value.
Kyndryl cares about your well-being and prides itself on offering benefits that give you choice, reflect the diversity of our employees and support you and your family through the moments that matter – wherever you are in your life journey.
Our employee learning programs give you access to the best learning in the industry to receive certifications, including Microsoft, Google, Amazon, Skillsoft, and many more.
Through our company-wide volunteering and giving platform, you can donate, start fundraisers, volunteer, and search over 2 million non-profit organizations.
At Kyndryl, we invest heavily in you, we want you to succeed so that together, we will all succeed.
Required Skill Profession
Computer Occupations