All posts
AI & ML

Building a RAG Pipeline That Actually Works

Bolting a vector database onto an LLM gives you a demo. Getting it to answer real questions over real documents is an engineering problem — chunking, retrieval, reranking, and knowing when not to retrieve at all. Here’s the pipeline that survives production.

Dhileep Kumar8 min read
Building a RAG Pipeline That Actually Works

The first RAG demo you build feels like magic. You drop some documents into a vector database, embed the user’s question, pull back the closest chunks, hand them to the model, and out comes a grounded answer with citations. Twenty lines of code and your LLM suddenly knows about your data. Then you point it at real documents and real questions, and the magic curdles: it cites the wrong passage, misses the obvious answer, or confidently makes something up anyway.

Retrieval-augmented generation is the most common production pattern in AI right now, and also the most commonly botched. The reason is that the demo version — “naive RAG” — is genuinely easy, and the production version is genuinely hard, and they look almost identical in a diagram. The gap between them is entirely engineering: how you split documents, how you retrieve, how you rank, and how you decide whether to retrieve at all. Close that gap and RAG is one of the highest-leverage things you can build.

What RAG actually does

Strip away the acronym and RAG is a single idea: don’t expect the model to know everything; look up the relevant facts first and put them in front of it. The model’s job shifts from “recall” to “read and synthesize” — and reading a passage you handed it is something language models are very good at. The whole pipeline exists to get the right passages into the context window at the right moment.

  • Ingest and chunk. Split your documents into passages small enough to be precise but large enough to stand alone. This unglamorous step decides more about quality than any model choice.
  • Embed and index. Turn each chunk into a vector with an embedding model and store it in a vector database, so you can search by meaning rather than just keywords.
  • Retrieve. Embed the user’s question the same way, find the nearest chunks, and — the part most people skip — rerank them so the best passage is actually first.
  • Generate. Hand the top chunks to the model with the question and an instruction to answer only from the provided context, ideally with citations.

Where naive RAG breaks

Every failed RAG system fails in one of a handful of predictable ways, and almost none of them are the model’s fault. They’re retrieval problems wearing a generation costume — the model can only be as good as the passages you give it, and naive pipelines give it bad ones.

  • Chunking that splits meaning. Cut a document every 500 characters and you’ll slice tables in half and orphan the sentence that answered the question. Chunk on structure — sections, paragraphs — not on a fixed character count.
  • Embeddings that miss the match. The question “how do I cancel? ” and the passage titled “ending your subscription” are semantically close but lexically distant. A weak embedding model won’t connect them, and retrieval quietly fails.
  • No reranking. Vector similarity gets you to the neighborhood, not to the door. The single most relevant chunk is often ranked third or fifth; without a reranker, it never reaches the model.
  • Retrieving similarity, not relevance. The closest vectors aren’t always the most useful — a chunk can be on-topic and useless. Top-k by distance alone pulls in near-duplicates and misses complementary context.
  • A stale or unfiltered index. Serving last quarter’s docs, or every tenant’s data to every user, turns retrieval into a liability. Freshness and metadata filtering aren’t optional in production.

A retrieval step in code

Here’s the core of a retrieval step in Python — embed the query, search the vector store, and assemble a grounded prompt. It’s deliberately small, because the interesting work isn’t the plumbing; it’s everything around it.

python
# retrieve.py - the heart of a RAG retrieval step.
def answer(question: str, store, embed, model) -> str:
    # 1. Embed the question with the SAME model used for the chunks.
    q_vec = embed(question)

    # 2. Pull more than you need, then rerank down to the best few.
    candidates = store.search(q_vec, top_k=20)
    top = rerank(question, candidates)[:4]

    # 3. Ground the model: answer only from the retrieved context.
    context = join_passages(top)          # blank line between chunks
    prompt = (
        f"Answer using only the context below. "
        f"If the answer is not there, say you do not know. "
        f"Context: {context} Question: {question}"
    )
    return model(prompt)

The shape is the whole lesson. You retrieve more candidates than you need (twenty) and rerank down to a handful (four), because retrieval is about recall first and precision second. You force the model to answer from the context and to admit when the answer isn’t there — the single most effective guard against confident hallucination. Everything else is making each of those steps better.

RAG isn’t a model feature you switch on; it’s a retrieval system you build. The quality of your answers is decided long before the model sees a single token — in how you chunk, search, and rank.

Making it production-grade

  • Hybrid search. Combine semantic (vector) search with keyword (BM25) search. Vectors catch meaning; keywords catch the exact terms, names, and codes that embeddings blur. Together they beat either alone.
  • A reranker. Add a cross-encoder reranking step between retrieval and generation. It’s the highest-ROI upgrade in most pipelines — it fixes the “right answer ranked fifth” problem directly.
  • Smarter chunking. Chunk on document structure, keep a little overlap so context isn’t severed at the seams, and attach metadata — source, date, section — to every chunk for filtering and citation.
  • Metadata filtering. Narrow the search before it runs — by tenant, recency, document type. Retrieving over the right subset beats reranking the wrong one.
  • Evals. Measure retrieval and answer quality on a golden set, and run it on every change. A RAG pipeline you can’t measure is one you can’t safely improve.

When RAG isn’t the answer

RAG is for knowledge that changes or is too big to fit — your docs, your tickets, today’s data. It is not the tool for teaching the model a new skill, tone, or format; that’s what fine-tuning is for, and confusing the two is a common, expensive mistake. As context windows grow, some short-document cases that needed RAG can now just be stuffed in whole; and at the frontier, agentic retrieval — where the model decides when and what to look up — is replacing the fixed pipeline. Reach for the simplest thing that grounds the answer.

The teams who win with RAG aren’t the ones with the fanciest model; they’re the ones who treat retrieval as a real engineering problem and measure it like one. Get the chunks right, rank them well, force the model to stay grounded, and put a number on it. Do that and RAG stops being a flaky demo and becomes the most reliable way to make a model speak fluently about things it was never trained on.

Share

Enjoyed this?

Get the next deep dive in your inbox. No spam — just the stories worth reading.

Subscribe to the newsletter

Comments