Senior ai/llm engineer - remote latam

Córdoba

Braintrust

Publicada el Publicado hace 2 hr horas

Descripción

Company Braintrust is a global talent network that connects top independent professionals with leading companies for high-quality, flexible work. We help organizations hire skilled talent faster while giving professionals access to vetted opportunities with innovative teams. Job description About this role

We are building

Altitude Intelligence

— the AI engine that turns Reunion Marketing's strategy, live client performance data, and market data into one queryable system of record for our automotive dealer clients. Over the next 12 months it powers three named outputs: AI Client Summaries

— client-ready performance narratives drafted from live data across KeyLift (SEO), Endeavor (LLM/answer-engine visibility), LocalEyes (Google Business Profile), Adapt (Paid Advertising), and Market Data. Proactive Performance Visibility

— anomalies and at-risk accounts routed to the SEO, Paid, GEO, GA4, and Client Success teams

before

the client sees them. Market Intelligence -

dealer-level reports that frame each market (Backyard / Battleground / Rival), quantify share, inventory, and visibility gaps, and land on a single strategic recommendations.

This role owns the

LLM application layer

of that engine - the system that ingests strategy and data, reasons over it, and produces the right artifact for the right audience through the right surface. You are the person contributing (and defending) the architectural calls: what to retrieve and how, when to use prompting vs. tool-use vs. graph-augmented retrieval, where the human belongs in the loop, what "ready for clients" means, and what we measure to prove it. This is a senior, hands-on role. It is meant for someone who has built and operated production LLM systems before and is comfortable owning the quality bar end-to-end.

Why this is a rare opportunity

Reunion has the kind of AI surface area most engineers want but rarely get: Real data.

Five product datasets, 15+ third-party APIs, Redshift, HubSpot, unstructured strategy documents, market data, and client history. Real users.

Client Success, SEO, Paid, GEO, Sales, Marketing, and external clients will consume the outputs through Altitude, HubSpot email, Slack, and Claude.ai. Real consequences.

If the system is wrong, the failure does not stay inside a demo. It can show up in a client conversation. A clear 12-month target.

The business outcomes are defined. The product direction is clear. The hard work is the engineering architecture, evaluation, reliability, and quality system that makes it production-grade. Open architecture with serious constraints.

The high-level shape exists: Knowledge + Context → MCP → reasoning engine → review → distribution. The direction is clear, but many of the important implementation decisions are still open, and this hire will help make them. What you'll own

1. LLM application architecture

You will design and own the full LLM pipeline from raw data and strategy documents to client-ready summaries, alerts, and recommendations. This includes: Versioned input and output schemas per product Prompt management as governed, reviewable artifacts Dynamic payload generation from live APIs, warehouse tables, and client context Structured output generation Product-specific context assembly Client-specific history and strategy context Output templates for different teams and audiences 2. Retrieval and knowledge architecture

You will design the retrieval system that grounds the model in Reunion’s strategy library, product data, client history, and market context. This is a hybrid retrieval problem. The answer will not be “just vector search.” You will decide when to use: SQL Vector search Lexical search Reranking Graph retrieval Tool calls Composed retrieval pipelines You will also help define the knowledge graph layer for relationship-heavy problems such as: Dealer ↔ market ↔ competitor Inventory ↔ campaign ↔ keyword Citation ↔ answer-engine visibility Client ↔ product ↔ performance trend Market ↔ ZIP ↔ segment ↔ opportunity The goal is not to use a graph because it sounds interesting. The goal is to know exactly when graph-augmented retrieval is the right tool and when it is not. 3. Agentic workflows and orchestration

You will design and ship the multi-step reasoning workflows behind: Monthly and quarterly AI Client Summaries Opportunity identification Market re-evaluation Inventory and sales alignment At-risk account flagging Anomaly explanation and routing These workflows need to be durable, observable, retryable, replayable, and cancellable. They cannot be fragile request handlers with hidden state and unclear failure modes. The orchestration substrate is still open. You will have a strong voice in that decision. 4. MCP tool surfaces

You will stand up and operate the MCP servers that expose Reunion’s knowledge and client context to the reasoning engine. This includes: Knowledge MCP: methodology, frameworks, strategy library, product context Context MCP: client data from HubSpot, Altitude, Redshift, and connected product systems You will own the production characteristics: Typed inputs and outputs Auth and tenant scoping Rate limiting Structured responses Observability Graceful degradation Clear failure handling 5. Evaluation, quality, and feedback loops

You will build the evaluation system end-to-end. That includes: Offline experiments Regression suites in CI Online scoring of production traces Factuality checks Structured output validation Voice and tone fidelity Cost and latency tracking Safety and client-readiness gates You will also close the loop between human feedback and system improvement. Reviewer edits, Client Success feedback, Gong tags, and Pendo engagement should become structured signal that improves prompts, retrieval, datasets, and evaluations. 6. Human-in-the-loop review

You will help design how strategists review, edit, approve, regenerate, or reject AI-generated outputs. The review experience needs to answer practical questions: What should a strategist see before approving? What should be editable? What should trigger regeneration? What should block distribution? How do edits become labeled data? How does the approval route to Altitude, HubSpot email, Slack, or other surfaces? The editor and review stack are still open decisions. 7. Production reliability

You will own the hard parts of production LLM systems: Hallucinations Retrieval misses Tool-use failures Structured-output drift Prompt drift Schema changes Cost spikes Latency regressions Prompt-cache invalidation Model deprecations Bad outputs that need rollback This role requires someone who instruments before guessing, writes decisions down, and understands that production AI reliability is an engineering discipline. 8. Architectural decision-making

You will be expected to make clear, evidence-backed calls on: Prompting vs. RAG RAG vs. GraphRAG Tool-use vs. direct generation Fine-tuning vs. better retrieval and evaluation Human review vs. automated release Fast iteration vs. client-safe release controls Strong opinions are welcome. Unsupported opinions are not. The stack (what's decided, what's open)

Decided / strongly anchored: Anthropic Claude

as the primary reasoning engine. MCP (Model Context Protocol)

as the tool/context surface between knowledge, per-client data, and the reasoning engine. HubSpot

,

Slack

, and

Altitude

as distribution surfaces;

Claude.ai

as an internal power-user surface. Open - you will have a voice: Application language and framework (TypeScript / Node, Python, or both). LLM application SDK / framework choice. Durable workflow orchestration substrate (Inngest is a leading candidate; alternatives on the table). Vector store choice (pgvector, Turbopuffer, Pinecone, Weaviate, or similar). Knowledge-graph database (Neo4j is the leading candidate; alternatives are on the table). Evaluation platform (commercial vs. open-source vs. homegrown). Human-in-the-loop editor / review surface. If you have strong, evidence-backed opinions on any of these, that is a feature of your candidacy, not a bug. What we're looking for

6+ years of software engineering experience

, with

2+ years shipping production LLM systems

— customer-facing, with real consequences when they break. Internal demos and proofs of concept do not count. Strong in a modern application language used for production LLM systems

— TypeScript/Node, Python, or both. You've built non-trivial, well-structured production systems, not glue scripts. Deep, hands-on familiarity with the LLM stack: Prompt design and structured output generation Retrieval architecture — vector, lexical, hybrid, reranking, freshness, multi-tenant scoping Tool-use and agent design — when to use one, when not to Evaluation as an engineering discipline (offline + online + regression gating) A defensible point of view on fine-tuning — including when not to do it

Production experience with knowledge graphs

(Neo4j or comparable) for at least one of: entity resolution, multi-hop reasoning, GraphRAG / graph-augmented retrieval, or relationship-rich domain modeling. You can articulate when a graph is the right answer and when it isn't. Has designed and shipped an evaluation suite that caught a real regression in production

— and can walk through what it caught, why it caught it, and what shipped (or didn't) because of it. Production debugging instincts for LLM failures

— you have specific failure modes you've fixed and a reflex for instrumenting before guessing. Durable workflow orchestration in production

— Inngest, Temporal, Step Functions, Airflow, or similar. You understand why multi-step LLM workflows belong in a durable runtime. MCP or equivalent tool-use surface in production.

You've either built an MCP server or shipped a comparable tool-use layer and are ready to consolidate on MCP. Track record of owning the quality bar

on a multi-team AI product. Including the unglamorous parts: incident response, rolling back a prompt change, defending a "not yet" decision to stakeholders who wanted to ship. Strong written and architectural communication.

You can write a one-page design proposal that a non-AI engineer, a strategist, and a CTO can all read and respond to. Nice to have

Production experience with

Anthropic Claude specifically

— tool-use, citations, prompt caching, extended thinking, vision. GraphRAG

or similar graph-augmented retrieval in production, including hybrid pipelines combining graph + vector + relational retrieval. Entity resolution / alias matching

at scale — the kind of problem where precision and recall both matter and the ground truth is ambiguous. Schema-as-code workflows

— typed schemas registered, versioned, and governed in source control. Multi-tenant client data

, especially marketing/CRM data (HubSpot, Gong, Pendo), and the access-control patterns that come with it. Domain experience

in marketing analytics, SEO/SEM, paid media, local search, or LLM answer-engine visibility. You will not be learning the domain from scratch. Experience building or operating a system of record that an organization actually runs on

, not just a feature inside one.

First 6-12 months — what success looks like

This is the shape, not a contract — exact ordering will be set with the team in the first 30 days. Q3 Foundations.

Architecture decisions made and written down: orchestration substrate, retrieval modalities (vector + graph + relational), MCP server boundaries, evaluation strategy, review surface. Knowledge MCP server and Context MCP server are live. First AI Client Summary running end-to-end on real data for one product, gated by an evaluation suite in CI. Breadth.

AI Client Summaries shipping across all five products (KeyLift, Endeavor, LocalEyes, Adapt, Market Data) with structured output packs (Wins / Opportunities / Risks / Next Steps). Knowledge-graph layer in production for entity resolution and relationship-aware context assembly. Online evaluation running on every production trace. Review and approval flow live for Client Success.

Q4 Visibility.

Proactive Performance Visibility live — anomaly explanations grounded in the data, routed to the correct owning team (SEO / Paid / GEO / GA4 / CS). Feedback from human reviewers measurably improving downstream output. LLM answer-engine visibility (the Endeavor surface) producing actionable signal end-to-end.

Q1 27’ Intelligence.

Market Intelligence flows running against live market data with benchmarks alongside. Continuous online evaluation, automated regression gating in CI, and a documented feedback loop with at least one feature whose measured quality has improved beyond its launch baseline. The system is operable, instrumented, and improving on its own cadence.

How we'll work together Decisions are written.

Architectural choices come with a short written proposal, the alternatives considered, and the evidence. We optimize for being able to revisit decisions later without re-litigating them from memory. Evals are first-class.

No feature ships without an eval. No regression is acceptable just because it's old. The human is in the loop on purpose, not by accident.

Where strategists review, edit, or override the AI, that is by design — and the signal it produces is treated as a primary asset. We move fast where the cost of being wrong is low, and slow where it lands in front of a client.

You will be expected to know the difference.

How we'll evaluate you We will ask you to walk through

one production LLM system you owned end-to-end

with the following specifics: The evaluation suite - what it caught, what it missed, what changed because of it. A failure mode you debugged in production — what your instrumentation told you, what your instinct told you, and which was right. A "no – not yet" call you made and defended to stakeholders who wanted to ship. An architectural decision where you chose between prompting, retrieval, tool-use, graph-augmented retrieval, and fine-tuning — and why. If those four stories are real and you can defend them, we will move fast.

Enviar

Crear una alerta

Guardar