How Retrieval-Augmented Generation (RAG) Works — and When It Fails

A practical retrieval-success playbook for production RAG systems.

Most “RAG problems” aren’t actually LLM problems. They’re retrieval problems masquerading as LLM problems.

The mental model that keeps you honest

RAG is a two-stage pipeline: retrieve evidence, then generate an answer from that evidence.

RAG accuracy ≈ retrieval accuracy × generation accuracy

If retrieval returns the wrong truth, the LLM cannot “think” its way back to the right answer. It will produce something that sounds plausible—and that’s exactly why RAG is dangerous when the retrieval layer is weak.


What Actually Happens Under the Hood

Let’s walk through the mechanics. Not a marketing diagram—an engineering diagram.

1) Ingestion: chunk → embed → store

Your documents are split into chunks, embedded into vectors, and stored in a vector database with metadata. The chunk is the unit of retrievability. If the answer spans multiple chunks, you’ve already increased your failure rate.

Chunk:    "Database columns must be snake_case starting with tbl_ prefix"
Embedding: [0.023, -0.91, ...]
Metadata:  { doc_type: "architecture_standard", version: "v3" }

2) Query → embedding

Your user’s question becomes an embedding vector. The retriever isn’t reasoning about rules or intent; it’s matching semantic proximity.

3) Vector search: “sounds like” not “contains”

A vector DB returns nearest neighbors (cosine/dot-product). That answers: “Which chunks sound like this question?” It does not guarantee: “Which chunks contain the answer?”

4) Context assembly: top-K, rerank, budget

You concatenate the top-K chunks, often rerank them, and then pack the final context within a token budget. This is where good evidence gets dropped, duplicates crowd out signal, or the “right” chunk is buried behind more similar—but less useful—chunks.

5) Generation: answer from evidence (or hallucinate)

The LLM is asked to answer using the provided evidence. Garbage in → confidently phrased garbage out.


Queries That Retrieve the Right Context

Some query shapes align naturally with semantic similarity. If your RAG workload is mostly these, you can ship a simple system and look like a hero.

Explicit fact-seeking queries (best case)

  • “What is the database column naming standard?”
  • “How do we name S3 buckets?”
  • “What is the retry policy for the payments API?”

These work because the query overlaps with how documentation is written (headings, noun phrases, explicit terms).

Definitions / description queries

Embeddings are strong at descriptive similarity. “Explain the authentication flow” often maps well to the section where your docs explain the authentication flow.

Single-hop questions

If the answer lives in one chunk (“Which team owns ingestion?”), RAG is in its comfort zone.

Queries that mirror author language

If your docs say “snake_case,” queries that use that term will retrieve cleanly. If your users say “underscore naming,” you’ll need help (more on that later).


Queries That Fail Retrieval (The Production Reality)

Most RAG systems break here.

Vague or implicit queries

“How should we store this?” has no anchor. The embedding has nothing to latch onto. The retriever returns “storage-ish” chunks and hopes for the best.

Fix: query rewriting, conversational memory, or asking a clarification question.

Multi-hop reasoning queries

“If we rename database columns, what downstream systems break?” requires multiple documents: standards, ETL, BI, CDC, ownership, deployment. No single chunk contains the answer.

Fix: multi-hop / agentic RAG (we’ll walk through it).

Procedural “How do I…” queries

Operational docs are fragmented. Retrieval finds something “operationally similar,” but not the full sequence.

Fix: query decomposition + reranking + guardrails (and sometimes a checklist-style corpus).

Polarity inversions (negation and constraints)

Docs: “Do not use camelCase.” Query: “Can we use camelCase?” Semantic similarity pulls the right chunk—but the meaning flips if the model mishandles negation.

Fix: make constraints explicit in your chunks; consider a reranker; prompt the generator to quote and cite the constraint.

Overloaded queries

“What’s our naming standard and how does it affect Kafka schemas and Snowflake tables?” Retrieval returns evidence for one clause, not all of them.

Fix: query decomposition and structured answering (answer each sub-question with citations).

Versioned / time-sensitive queries without metadata

“What is the current standard?” is a metadata question. Embeddings can’t infer temporal validity.

Fix: metadata filtering (status=approved, version=v3, effective_date, etc.).

Rare terminology, internal code names, and synonyms

If users say “Project Falcon” and docs say “Payments Platform initiative,” you’ll miss. Embeddings are good at paraphrase, but they can’t retrieve a term that simply doesn’t appear.


A Quick Retrieval Success Matrix

Query Type Retrieval Success Typical Fix
Explicit factual High Baseline RAG
Definitions High Baseline RAG
Single-hop High Chunking + reranking
Vague Low Query rewrite / clarify
Multi-hop Low Agentic / multi-hop RAG
Procedural Medium–Low Decompose + guardrails
Versioned / temporal Low (without metadata) Metadata filters
Negative constraints Risky Explicit constraints + cite

Why Embeddings Fail (Root Causes)

Embeddings are powerful, but they’re not magic. They’re often:

  • Context-agnostic: they don’t know what else is true in your system.
  • Logic-agnostic: they don’t encode dependencies or “if/then” rules.
  • Time-agnostic: they don’t understand what’s current vs deprecated.

Chunking is destiny. Chunking defines what can be retrieved together. If you split constraints away from the rule, you’re manufacturing wrong answers.

Bad chunking:

Chunk 1: "Columns must use snake_case"
Chunk 2: "Except for legacy tables"

If retrieval only returns Chunk 1, your answer is wrong.

Patterns That Improve Retrieval Success

Query rewriting

Transform vague queries into anchored, domain-specific questions. You can do this with a cheap model (or a rule-based layer) before you ever hit the vector DB.

Before: "How should we store this?"
After:  "What is the approved storage approach for customer PII data?"

Hybrid search (vector + keyword/BM25)

Keyword rescues proper nouns, IDs, versions, error codes. Vector rescues paraphrase. Together beats either alone.

Metadata filtering

Filter before retrieval. Don’t ask an embedding model to solve governance or recency.

{
  "doc_type": "architecture_standard",
  "status": "approved",
  "version": "v3"
}

Reranking

If vector search says “these are similar,” a reranker helps answer “which is most relevant.” This is often the highest ROI quality improvement after hybrid + metadata.

Query decomposition

If the question contains multiple clauses, break it apart and answer each piece with citations. You’ll stop the system from “averaging” evidence.


A Production-Grade RAG Flow (Single-Hop)

Here’s what “real” RAG looks like when you treat it like a system, not a demo notebook.

┌────────────┐
│   User     │
│  Question  │
└─────┬──────┘
      │
      ▼
┌─────────────────────┐
│ Query Understanding │
│  - Rewrite          │
│  - Decompose        │
│  - Intent classify  │
└─────┬───────────────┘
      │
      ▼
┌─────────────────────┐
│ Retrieval Planner   │
│  - Which corpus?    │
│  - Filters?         │
│  - Single vs multi? │
└─────┬───────────────┘
      │
      ▼
┌──────────────────────────────┐
│ Hybrid Retrieval Layer       │
│  - Vector search             │
│  - Keyword / BM25            │
│  - Metadata filtering        │
└─────┬────────────────────────┘
      │
      ▼
┌─────────────────────┐
│ Reranker            │
│  - Cross-encoder    │
│  - Top-K pruning    │
└─────┬───────────────┘
      │
      ▼
┌─────────────────────┐
│ Context Assembler   │
│  - Dedup            │
│  - Order            │
│  - Token budget     │
└─────┬───────────────┘
      │
      ▼
┌─────────────────────┐
│ LLM Generator       │
│  - Grounded prompt  │
│  - Citations        │
└─────┬───────────────┘
      │
      ▼
┌─────────────────────┐
│ Answer + Confidence │
└─────────────────────┘

When You Need Multi-Hop (Agentic) RAG

Single-hop RAG is a lookup system. Multi-hop RAG treats retrieval as a reasoning process: ask a sub-question, retrieve evidence, update state, decide the next hop.

┌────────────┐
│   User     │
│  Question  │
└─────┬──────┘
      │
      ▼
┌──────────────────────────┐
│ Question Decomposition   │
│  - Identify dependencies │
│  - Generate sub-queries  │
└─────┬────────────────────┘
      │
      ▼
┌──────────────────────────┐
│ Hop Controller / Agent   │
│  - Track state           │
│  - Decide next hop       │
└─────┬──────────┬─────────┘
      │          │
      ▼          ▼
┌──────────────┐ ┌──────────────┐
│ Retrieval A  │ │ Retrieval B  │
│ (Corpus 1)   │ │ (Corpus 2)   │
└─────┬────────┘ └─────┬────────┘
      │                │
      ▼                ▼
┌──────────────┐ ┌──────────────┐
│ Evidence A   │ │ Evidence B   │
└─────┬────────┘ └─────┬────────┘
      └──────┬─────────┘
             ▼
┌──────────────────────────┐
│ Evidence Synthesis       │
│  - Resolve conflicts     │
│  - Apply constraints     │
└─────┬────────────────────┘
      │
      ▼
┌──────────────────────────┐
│ Final Answer Generation  │
└──────────────────────────┘

A concrete multi-hop example: impact analysis

User question: “If we rename database columns, what downstream systems break?”

A good decomposition looks like:

  1. What systems consume DB columns? (ETL, CDC/Kafka, BI, exports)
  2. How does each system bind schema? (hard-coded SQL, dbt models, schema registry compat)
  3. Which binding modes break on rename? (fail loud vs fail silently)
  4. What rollout pattern makes it safe? (alias/dual-write, versioning, coordination)
# pseudo-Python
state = {}

while not state.get("enough_evidence"):
    next_question = planner(state)
    results = retrieve(next_question)
    state.update(extract_facts(results))

Production guardrails (hard-won lessons)

  • Cap hops (2–5) to prevent infinite exploration.
  • Cache intermediate evidence because users ask the same sub-questions repeatedly.
  • Force citations per hop; if a hop yields no evidence, fail fast.
  • Prefer determinism: use LLMs to plan the next hop, not to invent facts.

The big takeaway

RAG doesn’t fail because LLMs hallucinate. RAG fails because retrieval returned the wrong truth.

If your users mostly ask “What is…?”, baseline RAG is often enough. The moment they ask “What happens if…?” or “How do we safely…?”, treat retrieval as a multi-step system—with planning, filters, reranking, and guardrails.