Senior systems engineer (amer)

Amer (Provincia de La Rioja)

Nscale

Publicada el Publicado hace 21 hr horas

Descripción

Location:
Si cree que es el candidato adecuado para la siguiente oportunidad, envíe su solicitud después de leer la descripción completa.
United States (Travel Required)
Team:
Infrastructure
Reports to:
Head of Infrastructure
About Us
We are building next-generation AI infrastructure from the ground up. Our mission is to deliver highly performant, reliable, and scalable GPU clusters purpose-built for large-scale AI training and inference.
As a startup, we operate with urgency, ownership, and a bias toward action. We are assembling the foundational infrastructure that will power frontier AI workloads—and we’re looking for engineers who want to build it from zero to scale.
The Role
We are hiring a
Senior Deployment Engineer
to lead hands‑on bring‑up of GPU clusters across our data center environments. You will own the execution of node, rack, and network deployment, ensuring clusters are validated, performant, and production‑ready.
This role is deeply technical and execution‑focused. You will be in the details—cabling racks, validating firmware, tuning fabrics, debugging performance—and helping us build repeatable processes as we scale.
What You’ll Do
Execute end‑to‑end bring‑up of GPU nodes and racks from installation to production readiness.
Validate BIOS/BMC/firmware configurations and GPU health.
Perform rack‑level integration including power, cabling, and airflow validation.
Bring up and validate high‑speed network fabrics (InfiniBand, RoCE, 100–400G Ethernet).
Network & Performance Validation
Configure and validate leaf/spine network connectivity.
Run cluster‑wide burn‑in and stress testing.
Validate GPU‑to‑GPU and node‑to‑node performance (NCCL, RDMA, GPUDirect).
Troubleshoot hardware, firmware, and fabric‑level issues.
Automation & Process
Contribute to automation for provisioning and cluster validation.
Improve deployment playbooks and documentation.
Identify reliability issues early and drive corrective actions.
Help turn ad‑hoc deployments into repeatable systems.
Cross‑Functional Collaboration
Work closely with networking, systems software, and data center teams.
Coordinate with hardware vendors to resolve bring‑up issues.
Support rapid capacity expansion as we scale.
What We’re Looking For
Required
5–8+ years in infrastructure engineering, hardware deployment, or data center operations.
Hands‑on experience deploying GPU servers (HGX/DGX or similar platforms).
Experience with high‑speed networking (InfiniBand, RoCE, Ethernet fabrics).
Experience troubleshooting distributed systems performance issues.
Comfortable working onsite in data center environments as needed.
Strongly Preferred
Experience in AI/ML infrastructure or HPC environments.
Familiarity with NCCL, CUDA, RDMA.
Automation experience (Python, Ansible, Terraform, Bash).
Experience in high‑density power and cooling environments.
What Success Looks Like
Clusters are brought online quickly and correctly.
Performance baselines meet or exceed expectations.
Deployment processes become faster and more reliable over time. xhfqzwm
You help build the foundation for scaled infrastructure growth.
#J-18808-Ljbffr

Enviar

Crear una alerta

Guardar

Oferta cercana

Senior mechanical controls engineer - data center ops - nscale - $150,000 - $220,000 a year (amer)

Amer (Provincia de La Rioja)

Cloud Infrastructure