AI Evaluation Data Scientist
A fantastic opportunity for a driven AI Data Scientist to join a leading Quantum AI company, who work on cutting-edge solutions that make AI faster, greener, and more accessible. You’ll be working alongside world-leading experts in quantum computing and AI, with the opportunity to work on challenging projects and shape the future of Generative AI systems.
This is initially a 9 Month Fixed Term Contract, with scope to extend -
 * Hybrid working from sites in Madrid or Barcelona.
Responsibilities :
 * Design and lead the evaluation strategy for our Agentic AI and RAG systems, turning customer workflows and business needs into measurable metrics and clear success criteria.
 * Contribute to the end-to-end design of Agentic AI and RAG systems, injecting a data-and-evaluation perspective into retrieval strategies, orchestration policies, tool usage, and memory to solve complex, real-world problems across industries.
 * Develop task-based, multi-step evaluations that reflect how the different components of our systems (retrieval, planning, tool use, memory) perform in real-world scenarios across cloud and edge deployments.
 * Develop and refine rigorous evaluation frameworks that reflect real-world performance, going beyond model benchmarks to assess task success, reasoning capabilities, factual consistency, reliability, and user success metrics across diverse problem domains.
 * Build and maintain a reproducible evaluation pipeline, including datasets, scenarios, configs, test suites, versioned assets, and automated runs to track regressions and improvements over time.
 * Curate and generate high-quality datasets for evaluation, including synthetic and adversarial data, to strengthen coverage and robustness.
 * Implement and calibrate LLM-as-a-judge evaluations, aligning automated scoring with human feedback and ensuring fairness, robustness, and representativeness.
 * Perform deep error analyses and ablations to uncover failure patterns, maintain a taxonomy of failure modes (reasoning, grounding, hallucinations, tool failures), and provide actionable insights to engineers to improve model and system performance.
 * Partner with ML specialists to create a data flywheel, where evaluation continuously informs new dataset creation, improvements on prompts, tool usage, model training, and system refinements, quantifying improvements over time.
 * Define and monitor operational metrics (latency, cost, reliability) to ensure evaluations align with production and customer expectations.
 * Maintain high engineering standards, including clear documentation, reproducible experiments, robust version control, and well-structured ML pipelines.
 * Contribute to team learning and mentorship, guiding junior engineers and sharing expertise in LLM development, evaluation, and deployment best practices.
 * Participate in code reviews, offering thoughtful, constructive feedback to maintain code quality, readability, and consistency.
Required minimum Qualifications
 * Master's or Ph.D. in Computer Science, Machine Learning, Data Science, Physics, Engineering, or related technical fields, with relevant industry experience.
 * Solid hands-on experience (3+ years for mid-level, 5+ years for senior) working as a Data Scientist, ML Engineer, or Research Scientist in applied AI / ML projects deployed in production environments.
 * Strong background in evaluation of machine learning systems, ideally with experience in LLMs, RAG pipelines, or multi-agent systems.
 * Proven ability to design and implement evaluation methodologies that go beyond static benchmarks, capturing real-world task success, reasoning, and robustness.
 * Hands-on experience with dataset creation and curation (including synthetic data generation) for training and evaluation.
 * Proven experience with agent-based architectures (task decomposition, tool use, reasoning workflows), RAG architectures (retrievers, vector databases, rerankers), and orchestration frameworks (LangGraph, LlamaIndex).
 * Strong problem-solving skills, with the ability to navigate ambiguity and design practical solutions to open-ended user or business needs.
 * Strong software engineering skills, with proficiency in Python, Docker, Git, and experience building robust, modular, and scalable ML codebases.
 * Familiarity with common ML and data libraries and frameworks (e.g., PyTorch, HuggingFace, LangGraph, LlamaIndex, Pandas, etc.).
 * Experience with cloud platforms (ideally AWS).
 * Fluent in English.
By applying to this role, you understand that we may collect your personal data & store & process it on our systems. For more information please see our Privacy Notice (https : / / eu-recruit.com / wp-content / uploads / 2020 / 04 / Privacy-Notice.pdf)
#J-18808-Ljbffr