At Hays, we are collaborating with a global manufacturer of metal components for the automotive industry, present in 22 countries with more than 40,000 employees. This industrial environment generates big volumes of data across heterogeneous sources ( SAP ERP, SCADA systems, MES platforms, IoT sensors, relational databases, and document stores ) that have to be reliably extracted, normalised, and made available for analytics, reporting, and AI workloads.
We are currently looking for a Data Extraction Engineer who specialises in getting data out of complex source systems into the Azure-based lakehouse reliably and at scale. You will own the extraction and ingestion layer end-to-end, building robust pipelines that connect their operational systems to the Bronze layer of their Medallion architecture on Databricks.
What are the requirements?
~4+ years of hands-on experience building data extraction and ingestion pipelines in production environments.
~ Python proficiency for pipeline scripting, custom Airflow operators, and data validation logic.
~ Strong SQL skills across multiple engines: SQL Server, PostgreSQL, Azure SQL, and SAP HANA (CDS Views a strong plus).
~ Practical experience extracting from and integrating MongoDB : change streams, oplog-based CDC, schema flexibility handling.
~ Proficiency with Apache Airflow: authoring complex DAGs, managing dependencies, handling retries, and monitoring pipeline health.
~ Solid understanding of CDC patterns : log-based vs. query-based, Debezium, Kafka connectors, and Azure Event Hubs integration.
~ Experience landing data into cloud lakehouse architectures ( Azure Data Lake + Delta Lake / Databricks).
~ Advanced English communication skills (spoken and written).
Nice to have
Experience extracting data from SAP (ABAP SDK, CDS Views, BAPI/RFC, OData) — highly valued.
Familiarity with industrial data sources: SCADA historians, MES systems, OPC-UA, MQTT.
Knowledge of DBT for downstream transformation layers.
Experience in multi-plant or multi-country enterprise environments.
Background in the automotive or discrete manufacturing sector.
What will your responsibilities be?
Data Extraction & Ingestion (core focus)
Design and implement data extraction pipelines from SAP HANA (CDS Views, ABAP SDK), relational databases (SQL Server, PostgreSQL, Azure SQL), document stores (MongoDB), and SCADA/MES/IoT platforms handling high‑frequency time‑series data.
Build and maintain CDC pipelines from SAP and operational systems into Azure Data Lake Storage Gen2 using Apache Kafka / Azure Event Hubs.
Develop API-based connectors for SaaS platforms (HR systems, quality tools, third‑party suppliers) according to business needs.
Ensure accurate and complete ingestion into the Bronze layer (Delta format) with full metadata, lineage, audit trails, and extraction logging.
Pipeline Orchestration & Reliability
Orchestrate ingestion workflows using Apache Airflow (MWAA or AKS), ensuring pipelines are idempotent, observable, and operationally robust.
Implement monitoring, alerting, and SLA tracking to ensure data freshness and pipeline reliability.
Manage incremental versus full-load strategies tailored to source system load, data volume, and latency requirements.
Collaborate with Data Architecture teams to align the Bronze layer with downstream Silver and Gold processing on Databricks.
Data Quality at Source
Apply data quality validations at ingestion, including completeness checks, schema and type validation, referential integrity, and duplicate detection.
Document source system schemas, data models, data dictionaries, and known data quality issues.
Work closely with IT teams and system owners to manage source system constraints, maintenance windows, and schema changes.
Collaboration & Documentation
Partner with AI/ML and AI Agents teams to deliver curated datasets for analytics, RAG, and machine‑learning workloads.
Produce and maintain technical documentation, including extraction specifications, data contracts, pipeline runbooks, and ADRs.
Participate in architecture reviews and contribute to the data platform roadmap and continuous improvement initiatives.
What do we offer?
Remote work model but with possibility of on-site/hybrid model.
Location: Madrid (Headquarters).
Full‑time freelance/permanent contract with Hays.
We are looking for professionals like you, passionate about technology, and eager to take on a new challenge. If this aligns with you, apply for the position so we can share more details with you!