For a general IT Infrastructure Services client we are looking for a HPC Infrastructure & Scheduler Integration Engineer to design, build, and operate a PBS-based high-performance computing platform.
It will be a 3 month extendable contract.
You will work remotely with occasional travel to Stockholm, Sweden (must be eligible to travel).
This role focuses on integrating compute, storage, and orchestration layers with the scheduler, ensuring reliable job execution, efficient scaling, and seamless integration with modern platforms such as cloud, Kubernetes, and MLOps tools.
The Role
Developing and maintaining scheduler integrations, including hooks, prolog/epilog scripts, and custom automation
Automating the full job lifecycle from submission through execution to teardown
Designing and managing HPC environments across bare metal, virtualized, and hybrid cloud setups
Integrating the scheduler with storage systems (e.g. Lustre), networking (InfiniBand/Ethernet), and identity services (LDAP/Kerberos)
Bridging HPC workloads with modern platforms such as Kubernetes, MLOps frameworks, and cloud bursting solutions
Optimizing scheduling performance, resource allocation, and cluster utilization
Implementing observability (logging, metrics, dashboards) and supporting incident response and root cause analysis
Skills Required
Core Skills
Experience with HPC schedulers (PBS Pro/OpenPBS preferred; Slurm/Torque acceptable)
Proficiency in scripting and automation (Python and Bash required; Go or Rust a plus)
Solid understanding of distributed systems and cluster operations
HPC Expertise
Experience with MPI workloads (OpenMPI, MPICH)
Familiarity with GPU scheduling (NVIDIA stack, MIG/MPS)
Knowledge of parallel file systems (Lustre strongly preferred)
Understanding of scheduling concepts (queues, priorities, backfill, fairshare, reservations)
Infrastructure & Integration
Experience with configuration management (Ansible, Puppet, etc.)
Exposure to CI/CD for infrastructure and API-driven integrations
Familiarity with cloud platforms and hybrid HPC architectures
Preferred Experience
Building custom PBS hooks or scheduler extensions in production
Designing hybrid HPC + Kubernetes or cloud bursting solutions
Operating at scale (10k+ cores, multi-petabyte storage)
Experience with security/compliance frameworks (e.g. NIST, STIGs)
#J-18808-Ljbffr