Founded in 2019, our client had grown into one of Europe’s most recognized deep-tech scale-ups, backed by major integral strategic investors and EU innovation funds.
Their quantum and AI technologies had already transformed how enterprise clients built and deployed intelligent systems — achieving up to 95% model compression and 50–80% inference cost reduction.
The company was recognized by CB Insights (2023 & 2025) as one of the Top 100 most promising AI companies globally, often described as a “quantum–AI unicorn in the making.”
Role Highlights
The AI Evaluation Data Scientist was responsible for:
- Designing and leading evaluation strategies for Agentic AI and RAG systems, translating complex workflows into measurable performance metrics.
- Developing multi-step task-based evaluations to capture reasoning quality, factual accuracy, and end-user success in real-world scenarios.
- Building reproducible evaluation pipelines with automated test suites, dataset tracking, and performance versioning.
- Curating and generating synthetic and adversarial datasets to strengthen system robustness.
- Implementing LLM-as-a-judge frameworks aligned with human feedback.
- Conducting error analysis and ablations to identify reasoning gaps, hallucinations, and tool-use failures.
- Collaborating with ML engineers to create a continuous data flywheel linking evaluation outcomes to product improvements.
- Defining and monitoring operational metrics such as latency, reliability, and cost to meet production standards.
- Maintaining high standards in engineering, documentation, and reproducibility.
Candidate Profile
- Master’s or Ph.D. in Computer Science, Machine Learning, Physics, Engineering, or related field.
- 3+ years (mid-level) or 5+ years (senior) of experience in Data Science, ML Engineering, or Research roles in applied AI/ML projects.
- Proven experience designing and implementing evaluation methodologies for machine learning or Generative AI systems.
- Hands-on experience with LLMs, RAG pipelines, and agentic architectures.
- Proficiency in Python, Git, Docker, and major ML frameworks (PyTorch, HuggingFace, LangGraph, LlamaIndex).
- Familiarity with cloud environments (AWS preferred).
- Excellent communication skills and fluency in English.
Preferred
- Ph.D. in a relevant technical discipline.
- Experience with synthetic data generation, adversarial testing, and multi-agent evaluation frameworks.
- Strong background in LLM error analysis and reliability testing.
- Open-source contributions or publications related to AI evaluation.
- Fluency in Spanish.
Contract Details
- Location: Madrid or Barcelona
- Type: Fixed-term (until June 2026)
- Work Model: Hybrid (3 days onsite, 2 remote)
- Seniority: Associate
- Department: Technical
Compensation and Benefits
- Competitive salary package.
- Signing and retention bonuses.
- Relocation support where applicable.
- Flexible working hours and equal pay guarantee.
- Inclusive, international, and innovation-driven environment.