Ai evaluation data scientist - ai/ml/llm - (hybrid) - madrid

Madrid

European Tech Recruit

75.000 € al año

Publicada el Publicado hace 15 hr horas

Descripción

AI Evaluation Data Scientist

A fantastic opportunity for a driven AI Data Scientist to join a leading Quantum AI company, who work on cutting-edge solutions that make AI faster, greener, and more accessible. You’ll be working alongside world-leading experts in quantum computing and AI, with the opportunity to work on challenging projects and shape the future of Generative AI systems.

This is initially a 9 Month Fixed Term Contract, with scope to extend -

* Hybrid working from sites in Madrid or Barcelona.

Responsibilities :

* Design and lead the evaluation strategy for our Agentic AI and RAG systems, turning customer workflows and business needs into measurable metrics and clear success criteria.
* Contribute to the end-to-end design of Agentic AI and RAG systems, injecting a data-and-evaluation perspective into retrieval strategies, orchestration policies, tool usage, and memory to solve complex, real-world problems across industries.
* Develop task-based, multi-step evaluations that reflect how the different components of our systems (retrieval, planning, tool use, memory) perform in real-world scenarios across cloud and edge deployments.
* Develop and refine rigorous evaluation frameworks that reflect real-world performance, going beyond model benchmarks to assess task success, reasoning capabilities, factual consistency, reliability, and user success metrics across diverse problem domains.
* Build and maintain a reproducible evaluation pipeline, including datasets, scenarios, configs, test suites, versioned assets, and automated runs to track regressions and improvements over time.
* Curate and generate high-quality datasets for evaluation, including synthetic and adversarial data, to strengthen coverage and robustness.
* Implement and calibrate LLM-as-a-judge evaluations, aligning automated scoring with human feedback and ensuring fairness, robustness, and representativeness.
* Perform deep error analyses and ablations to uncover failure patterns, maintain a taxonomy of failure modes (reasoning, grounding, hallucinations, tool failures), and provide actionable insights to engineers to improve model and system performance.
* Partner with ML specialists to create a data flywheel, where evaluation continuously informs new dataset creation, improvements on prompts, tool usage, model training, and system refinements, quantifying improvements over time.
* Define and monitor operational metrics (latency, cost, reliability) to ensure evaluations align with production and customer expectations.
* Maintain high engineering standards, including clear documentation, reproducible experiments, robust version control, and well-structured ML pipelines.
* Contribute to team learning and mentorship, guiding junior engineers and sharing expertise in LLM development, evaluation, and deployment best practices.
* Participate in code reviews, offering thoughtful, constructive feedback to maintain code quality, readability, and consistency.

Required minimum Qualifications

* Master's or Ph.D. in Computer Science, Machine Learning, Data Science, Physics, Engineering, or related technical fields, with relevant industry experience.
* Solid hands-on experience (3+ years for mid-level, 5+ years for senior) working as a Data Scientist, ML Engineer, or Research Scientist in applied AI / ML projects deployed in production environments.
* Strong background in evaluation of machine learning systems, ideally with experience in LLMs, RAG pipelines, or multi-agent systems.
* Proven ability to design and implement evaluation methodologies that go beyond static benchmarks, capturing real-world task success, reasoning, and robustness.
* Hands-on experience with dataset creation and curation (including synthetic data generation) for training and evaluation.
* Proven experience with agent-based architectures (task decomposition, tool use, reasoning workflows), RAG architectures (retrievers, vector databases, rerankers), and orchestration frameworks (LangGraph, LlamaIndex).
* Strong problem-solving skills, with the ability to navigate ambiguity and design practical solutions to open-ended user or business needs.
* Strong software engineering skills, with proficiency in Python, Docker, Git, and experience building robust, modular, and scalable ML codebases.
* Familiarity with common ML and data libraries and frameworks (e.g., PyTorch, HuggingFace, LangGraph, LlamaIndex, Pandas, etc.).
* Experience with cloud platforms (ideally AWS).
* Fluent in English.

By applying to this role, you understand that we may collect your personal data & store & process it on our systems. For more information please see our Privacy Notice (https : / / eu-recruit.com / wp-content / uploads / 2020 / 04 / Privacy-Notice.pdf)

#J-18808-Ljbffr

Enviar

Crear una alerta

Guardar