Retrieval Is the Hard Part of RAG, Not Generation

When a RAG app returns a wrong answer, the first instinct is to blame the model. Teams swap one foundation model for another, widen the context window, and rewrite the prompt for the third time. Accuracy moves two or three points and then stalls. The real failure is upstream: the model never saw the right passage, because retrieval never surfaced it. You cannot generate a correct answer from context that does not contain the fact.

I see this on almost every AI related call. The generation step is rarely the bottleneck, retrieval is.

A hallucination is usually a retrieval miss

On a recent engagement, a clinical knowledge base with roughly 40k PDFs was returning a wrong-answer rate of ~30%. The engineering team were convinced they needed a bigger model to fix this. We instrumented the pipeline and measured recall@5: how often the correct passage appeared in the top 5 retrieved chunks. It was roughly from what I recall ~55-60%.

That single number reframed the project and the generator was being handed the wrong evidence the rest of the time and still producing a defensible answer most of the time. No model swap fixes a that sort of recall ceiling. Before you touch the LLM, build an evaluation set of 100 to 200 real questions with known-correct source passages, then measure recall@k and MRR. If you are not measuring retrieval in isolation, you are tuning things blind.

Chunking sets the ceiling everything else lives under

The most common mistake is fixed-size chunking where one splits every document into 512-token windows and embeds each one. It is simple and it quietly destroys recall. A fixed window cuts tables in half, separates a clause from the heading that gives it meaning, and strands the answer across two chunks so neither one is retrievable on its own.

It sounds obvious in retrospect but what helped was:

Structure-aware splitting. We chunked on document structure (headings, sections, list boundaries) instead of raw token counts. For PDFs this meant parsing layout first, not treating the file as a flat string.
Chunk with an overlap of about 10 to 15%. Enough to keep a sentence that straddles a boundary retrievable from both sides.
Contextual headers are key. Prepend the section title and document title to each chunk before embedding. A chunk that reads "dosage: 5mg" is useless without "Section 4: Paediatric Administration" attached.

Structure-aware chunking alone took recall@5 up to ~74%, with no change to the model or the vector store.

Dense vectors miss exact terms. Add lexical search.

Embedding similarity is strong on meaning and weak on specifics. It will happily rank a semantically related passage above the one containing the exact drug name, error code, or contract clause the user typed. Dense retrieval does not know that CVE-2026-1184 is a string that must match exactly.

The fix is hybrid searching. Run dense vector search and lexical (BM25) search in parallel, then fuse the rankings. Postgres with pgvector for embeddings and a tsvector GIN index for full text covers both in one database, no extra infrastructure. Reciprocal Rank Fusion combines the two result sets without tuning a weight:

-- Reciprocal Rank Fusion over dense + lexical rankings
WITH dense AS (
  SELECT id, row_number() OVER (ORDER BY embedding <=> :query_vec) AS rank
  FROM chunks ORDER BY embedding <=> :query_vec LIMIT 50
),
lexical AS (
  SELECT id, row_number() OVER (ORDER BY ts_rank(tsv, websearch_to_tsquery(:q)) DESC) AS rank
  FROM chunks WHERE tsv @@ websearch_to_tsquery(:q) LIMIT 50
)
SELECT id, sum(1.0 / (60 + rank)) AS score
FROM (SELECT * FROM dense UNION ALL SELECT * FROM lexical) r
GROUP BY id ORDER BY score DESC LIMIT 20;

Then rerank by taking the top 20 fused candidates and run them through a cross-encoder reranker like bge-reranker-v2 self-hosted, or a managed rerank API. A cross-encoder reads the query and chunk together rather than comparing two precomputed vectors, so it is far more precise. You only run it on 20 candidates, so the latency cost is small (about 40ms added to retrieval in our case).

Results

Stacking these changes:

recall@5 went from to ~90%.
End-to-end wrong-answer rate dropped from ~30% to 9%.
Median retrieval latency rose from 70ms to 110ms, well inside budget.
Zero change to the foundation model or the prompt.

The lesson holds across every RAG system I have worked on, including our own internal tools - retrieval quality is the lever, and it is measurable. Fix the evidence you feed the model before you spend a cent on a bigger one.

If your RAG pipeline is plateauing and you are not sure whether the problem is retrieval or generation, book a strategy call and we will instrument it and find out.