● Design, deploy, and maintain Kubeflow (or equivalent) for pipeline orchestration, model training, evaluation, and serving on large image datasets; ensure reliability, security, and cost efficiency.
● Manage and tune Kubernetes clusters (EKS/GKE/AKS), set up namespaces, RBAC, autoscaling, network policies, and service meshes where appropriate; keep upgrades and operations predictable.
● Define infrastructure-as-code with Terraform; implement repeatable environment provisioning, configuration management, and golden paths for teams.
● Establish CI/CD workflows (GitHub Actions/Jenkins/GitLab CI), build/test standards, and progressive delivery patterns that keep releases fast and low-risk.
● Implement logging, metrics, and tracing (e.g., Prometheus, Grafana, CloudWatch, Splunk/New Relic) with actionable SLOs, alerts, and runbooks; embed security and compliance by design.
● Collaborate closely with product and science teams to remove bottlenecks, eliminate manual steps, and evolve service and data interfaces that make operating image pipelines simple and reliable.
● Contribute to future-state architectures that improve scalability, resiliency, and operational efficiency; lead targeted refactors and platform improvements.
● Manage core automation and tooling, and educate teams on platform capabilities, CI/CD, configuration management, and infrastructure automation best practices.
Required (Must-have):
● M.Sc. in Computer Science/Engineering (or equivalent) or comparable industry experience.
● Practical, production experience operating Kubeflow Pipelines for reproducible ML workflows at scale.
● Proven experience deploying and operating workloads on Kubernetes (EKS/GKE/AKS), including upgrades, autoscaling, RBAC, networking, and reliability; strong Unix/Linux fundamentals.
● Hands-on experience with AWS services (EKS, EC2, S3, IAM, CloudWatch; RDS a plus) and the ability to design secure, cost-aware architectures.
● Strong Terraform skills and Git-based workflows for repeatable infrastructure provisioning and configuration management.
● Practical experience with CI/CD platforms (GitHub Actions/Jenkins/GitLab CI), including artifact management, environment promotion, and progressive delivery. ● Solid Python and/or shell scripting for platform automation and toil reduction.
● Experience implementing logging, metrics, and tracing with SLOs, alerts, and runbooks (e.g., Prometheus, Grafana, CloudWatch, Splunk/New Relic) and a security-first mindset.
● Ability to lead technical initiatives, communicate trade-offs clearly, and collaborate effectively with engineering and science teams
Desirabel (Nice to have):
● Experience with MLflow, Feast, Argo, Airflow, Ray, and model versioning/monitoring.
● Familiarity with S3/object storage, artifact registries, and handling large image datasets; basic SQL/NoSQL exposure.
● Experience with digital pathology or large-scale image processing (e.g., whole-slide images) and tools like OpenSlide, scikit-image, or OpenCV.
● Experience tuning high-throughput pipelines, concurrency, memory usage, and integrating GPUs/accelerators.
● Experience with VPC design, ingress/egress, service meshes, secrets management, IAM, and policy as code.
● Experience in regulated environments (e.g., GxP), including data governance, privacy, and building software under regulated processes.
● Experience with Jira/Zendesk and with JavaScript/TypeScript for internal tools or dashboards.