Senior sre specialist

Santander

Intellias

Publicada el Publicado hace 10 hr horas

Descripción

Location: Remote from Spain (an indefinite Spanish employment contract)

Project Intro

We are a publicly-traded FinTech company who run mobile, web and desktop platforms that help our clients trade stocks & shares, leveraged products, Futures & Options and Crypto.

We are ambitious. Over 340,000 people already use our platforms. We’re global with offices in 18 countries and products in 16 regions. We’re hungry to move faster, ship better product for our customers and grow our user base. We believe in high autonomy, and we want people who are looking to do things differently in order to create better experiences for our customers.

We work in cross-functional teams and are laser focused on increasing the number of active clients we serve to drive sustainable growth.

Your team

The SRE Team comprises highly skilled software engineers dedicated to embedding performance and reliability into our trading platform. You'll work with cutting-edge distributed systems handling high-throughput, low-latency trading operations that demand zero downtime.

As a Site Reliability Engineer, you'll champion reliability patterns, improve observability, establish 24/7 operations, and drive operational excellence across our crypto trading platform infrastructure and associated applications.

Requirements
* Java development experience - Must be able to read, write, and instrument Java code. Deep understanding of JVM internals and experience with complex distributed Java applications
* Observability & Instrumentation - Hands‑on experience with OpenTelemetry, distributed tracing concepts (spans, trace context propagation), and observability platforms such as Honeycomb, Datadog, Dynatrace, Splunk or Grafana. Strong understanding of OpenTelemetry Collector pipelines, including data transformation, enrichment, and labeling, use of processors (attributes, resource, transform, span, tail sampling), and propagation of custom business identifiers (e.g., customer/tenant/transaction IDs) across services to enable end‑to‑end trace correlation between heterogeneous systems, applications, and environments.
* SLO/SLI Expertise - Proven experience defining SLOs based on SLIs, establishing error budgets, and working with development teams on reliability measurement
* Reliability Patterns - Solid understanding of circuit breakers, retry logic, bulkheads, and other fault tolerance patterns
* Cloud - AWS & Kubernetes Platform Engineering – Strong hands‑on experience with AWS as the primary cloud provider, including production workloads on Amazon EKS. Proven expertise in Kubernetes networking, covering ingress and egress controllers (e.g., ALB / NGINX / Envoy), service configuration and fine‑tuning (requests/limits, HPA/VPA, pod disruption budgets, network policies), and traffic management. Demonstrated ability to investigate and optimize performance and reliability using metrics, logs, and traces, complemented by chaos engineering practices (fault injection, node/pod failures, network latency, dependency outages) to validate system resilience and high availability under real‑world failure scenarios.
* Message Brokers - Production experience with ActiveMQ, Kafka, or similar messaging systems
* Containerization - Hands‑on experience with container orchestration (Nomad experience is advantageous, Kubernetes acceptable)
* CI/CD - Experience building and maintaining deployment pipelines, preferably with GitLab
Experience Requirements
* Track record in high‑throughput, production environments (financial services, trading platforms, or similar mission‑critical systems preferred)
* Demonstrated ability to improve system reliability and performance at scale
* Experience working collaboratively with development teams to implement observability and reliability improvements
* Strong troubleshooting skills in distributed systems environments
Core Competencies
* Systems thinking approach to problem‑solving
* Excellent communication skills for cross‑functional collaboration and technical enablement
* Ability to balance hands‑on development work with operational responsibilities
* Strong bias toward automation and eliminating manual toil
* Comfortable working in a fast‑paced environment with evolving requirements
Responsibilities System Reliability & 24/7 Operations
* This role excludes on call support.
* Implement comprehensive monitoring and observability using OpenTelemetry and distributed tracing
* Establish and maintain 24/7 operational readiness including automated deployments, blue/green releases, and zero‑downtime patching strategies
* Define and track Service Level Objectives (SLOs) and Error Budgets for critical crypto trading services
* Identify and eliminate single points of failure in distributed systems
Application Instrumentation & Observability
* Instrument Java applications with OpenTelemetry spans, metrics, and traces
* Work hands‑on with development teams to add observability to their code
* Guide teams on implementing meaningful SLIs that reflect user experience
Technical Leadership & Enablement
* Partner with development teams on system design, capacity planning, and architectural reviews
* Provide technical guidance and hands‑on support for teams transitioning from traditional deployments to containerized infrastructure
* Mentor developers on reliability patterns including circuit breakers, retry logic, and fault tolerance
* Lead by example - write production code that demonstrates SRE best practices
Software Development & Automation
* Write clean, maintainable code in Java and Python following industry best practices
* Build automation tools and CI/CD pipelines that embed reliability practices
* Contribute to application codebases to implement instrumentation and reliability patterns
* Apply software engineering discipline including version control, code reviews, and testing
Why this position
* Crypto trading operates 24/7/365 with no maintenance windows. Your work will directly impact our ability to provide continuous, reliable service to clients trading in volatile markets where every second of downtime has significant business impact.
#J-18808-Ljbffr

Enviar

Crear una alerta

Guardar