What Is RAG? Retrieval-Augmented Generation Explained

Q: What if my question spans multiple documents?

Use sub-question decomposition: break the question into sub-questions, retrieve for each, then synthesize. Both LlamaIndex (SubQuestionQueryEngine) and LangChain support this.

Q: How do I evaluate RAG quality?

Key metrics: - Retrieval recall: does the retrieved context contain the answer? - Answer faithfulness: is the answer grounded in the context? - Answer relevance: does the answer address the question? Tools: RAGAS (automated RAG evaluation), LangSmith, Arize Phoenix.

Q: My RAG system gives wrong answers — how do I debug?

1. Check retrieval: print what was retrieved. If the right chunks aren't retrieved, fix chunking/embedding. 2. Check the prompt: is the LLM instructed to stay grounded in context? 3. Check the model: smaller models are worse at synthesizing from context. Try GPT-4o. 4. Check chunk quality: if chunks are too small, they lose context. Increase overlap.

The Problem RAG Solves

LLMs have two fundamental limitations when it comes to knowledge:

Knowledge cutoff — they only know what was in their training data, which has a cutoff date
Hallucination — they generate plausible-sounding but potentially false answers about facts they’re uncertain about

Ask GPT-4: “What are the specs of our new product launched last week?” — it can’t answer. Ask it “What’s the boiling point of element 119?” — it might confidently invent a number.

RAG (Retrieval-Augmented Generation) solves both problems by giving the LLM access to a relevant knowledge base at query time.

The Core Idea

Instead of relying purely on the LLM’s memorized knowledge, RAG works in two phases:

User question → [Retrieve relevant documents] → [Generate answer from documents]

The LLM’s job shifts from “recall a fact” to “answer this question given these documents.” This is far more reliable — and verifiable.

How RAG Works: Step by Step

Step 1: Ingestion (Offline, Run Once)

Your documents are processed and stored in a vector database:

Documents (PDFs, text, web pages)
        ↓
Text Splitter (chunk into 500-token pieces)
        ↓
Embedding Model (convert each chunk to a vector)
        ↓
Vector Database (store vectors + original text)

An embedding model converts text to a high-dimensional vector (e.g., 1536 numbers for OpenAI’s text-embedding-3-small) that captures semantic meaning. Similar sentences produce similar vectors.

Step 2: Retrieval (Online, Every Query)

User question
        ↓
Embed the question (same embedding model)
        ↓
Vector similarity search (find top-K closest chunks)
        ↓
Return relevant text chunks

Similarity is typically measured by cosine similarity — the angle between two vectors. Chunks with high similarity to the query are semantically related.

Step 3: Generation

[System prompt]
[Retrieved chunks as context]
[User question]
        ↓
LLM generates an answer based ONLY on the provided context
        ↓
Response to user

Minimal Working Example

from openai import OpenAI
import numpy as np

client = OpenAI()

# ── Step 1: Ingestion ──────────────────────────────────────────────
documents = [
    "Our refund policy allows returns within 30 days of purchase.",
    "Products must be in original condition for a full refund.",
    "Digital products are non-refundable once downloaded.",
    "Shipping costs are non-refundable.",
    "To initiate a return, email [email protected] with your order number.",
]

def get_embedding(text: str) -> list[float]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text,
    )
    return response.data[0].embedding

# Embed all documents
doc_embeddings = [get_embedding(doc) for doc in documents]

# ── Step 2: Retrieval ──────────────────────────────────────────────
def cosine_similarity(a: list[float], b: list[float]) -> float:
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

def retrieve(query: str, top_k: int = 3) -> list[str]:
    query_embedding = get_embedding(query)
    scores = [cosine_similarity(query_embedding, de) for de in doc_embeddings]
    top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]
    return [documents[i] for i in top_indices]

# ── Step 3: Generation ─────────────────────────────────────────────
def rag_answer(question: str) -> str:
    context_chunks = retrieve(question)
    context = "\n".join(f"- {chunk}" for chunk in context_chunks)

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": (
                    "Answer questions based ONLY on the context provided. "
                    "If the context doesn't contain the answer, say so clearly."
                )
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {question}"
            }
        ]
    )
    return response.choices[0].message.content

# Test
print(rag_answer("Can I return a downloaded ebook?"))
# → "No, digital products are non-refundable once downloaded."

print(rag_answer("How long do I have to return something?"))
# → "You can return items within 30 days of purchase."

In production, you’d use a proper vector database (Pinecone, Weaviate, Chroma) instead of in-memory lists.

The Vector Database Role

A vector database is purpose-built for similarity search at scale:

Database	Type	Best For
Pinecone	Managed cloud	Production, scalability
Weaviate	Self-hosted / cloud	Hybrid search
Chroma	Local / embedded	Development, small datasets
pgvector	PostgreSQL extension	Already using Postgres
FAISS	In-memory library	Research, prototyping

All work the same way conceptually: store vectors, return nearest neighbors.

Chunking Strategy

How you split documents critically affects RAG quality:

Fixed-size chunks (simplest):

chunk_size = 500  # tokens
overlap = 50      # tokens overlap between chunks

Sentence-aware splitting (better): Split at sentence boundaries to avoid cutting mid-thought.

Semantic splitting (best quality, slowest): Use another LLM to split at semantic boundaries — each chunk covers one complete idea.

Rule of thumb:

Short chunks (256-512 tokens): better retrieval precision
Long chunks (1024+ tokens): more context for generation
50-token overlap: prevents splitting key information

When RAG Is (and Isn’t) the Answer

Use RAG when:

Your knowledge base updates frequently
Questions require specific factual accuracy (product specs, legal text)
You need to cite sources
Your knowledge doesn’t fit in a context window

Don’t use RAG when:

Your knowledge fits in a single context window (just stuff it in the prompt)
Questions require synthesizing all your data (aggregate queries)
Real-time data is needed (use function calling + API instead)
Pure reasoning without external knowledge is needed

RAG vs. Fine-tuning

A common confusion: should I RAG or fine-tune?

	RAG	Fine-tuning
Knowledge updates	Easy — just re-ingest	Hard — retrain
Source citations	Built-in	Not available
Cost at scale	Per-query retrieval	One-time training
Factual accuracy	High (grounded in docs)	Can still hallucinate
Learns new formats/styles	No	Yes

Short answer: RAG for facts. Fine-tuning for style/behavior. Combine both for best results.

Frequently Asked Questions

How many chunks should I retrieve (top-K)?

Start with 5. Too few and you might miss the answer. Too many and you dilute the context with irrelevant text. Reranking (scoring retrieved chunks for relevance) helps you retrieve more and filter down.

What if my question spans multiple documents?

Use sub-question decomposition: break the question into sub-questions, retrieve for each, then synthesize. Both LlamaIndex (SubQuestionQueryEngine) and LangChain support this.

Can RAG work with images and tables?

Yes, with multimodal embeddings. For tables in PDFs, tools like LlamaParse extract structured markdown. For images, multimodal models can generate text descriptions for embedding.

How do I evaluate RAG quality?

Key metrics:

Retrieval recall: does the retrieved context contain the answer?
Answer faithfulness: is the answer grounded in the context?
Answer relevance: does the answer address the question?

Tools: RAGAS (automated RAG evaluation), LangSmith, Arize Phoenix.

My RAG system gives wrong answers — how do I debug?

Check retrieval: print what was retrieved. If the right chunks aren’t retrieved, fix chunking/embedding.
Check the prompt: is the LLM instructed to stay grounded in context?
Check the model: smaller models are worse at synthesizing from context. Try GPT-4o.
Check chunk quality: if chunks are too small, they lose context. Increase overlap.

Next Steps

Getting Started with LlamaIndex — Full RAG pipeline with a vector store
LlamaIndex Advanced Retrieval Techniques — Hybrid search, reranking, HyDE
RAG Paper Explained — The original RAG research paper