RAG Paper Explained: Retrieval-Augmented Generation for NLP

Paper Overview

“Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., Kiela, D. (2020) Facebook AI Research (FAIR) Advances in Neural Information Processing Systems (NeurIPS 2020)

arxiv.org/abs/2005.11401

Why it matters: This paper coined the term “RAG” and defined the architecture that all modern RAG systems are built on. Before RAG, the only way to give LLMs access to external knowledge was fine-tuning or prompting with documents. RAG introduced a learnable retrieval mechanism trained end-to-end with the generator.

The Problem: LLMs Have Frozen Knowledge

The 2020 paper was written before LLMs became mainstream (GPT-3 was released the same year). The specific problem they addressed:

“Parametric” vs. “Non-parametric” memory

Parametric: knowledge stored in model weights (frozen after training)
Non-parametric: knowledge in an external store (can be updated)

LLMs are purely parametric. When you ask a model a factual question, it retrieves from memorized training data. This creates problems:

The knowledge is stale — the model knows nothing after its training cutoff
The knowledge is opaque — you can’t cite or verify what the model used
The knowledge is rigid — updating requires expensive retraining

The solution: combine a parametric LM (for generation) with a non-parametric retrieval component (for knowledge lookup).

The RAG Architecture

Query
  ↓
Dense Retriever (DPR — Dense Passage Retrieval)
  ↓ retrieves top-k documents from Wikipedia
Generator (BART seq2seq model)
  ↓ generates answer conditioned on query + retrieved docs
Answer

Two sub-models are trained:

Retriever (Dense Passage Retrieval, DPR): encodes queries and documents into dense vectors, retrieves top-k most relevant
Generator (BART): generates the answer sequence, conditioned on the query and retrieved documents

RAG-Sequence vs. RAG-Token

The paper proposes two variants:

RAG-Sequence: Use the same set of retrieved documents for the entire output sequence

p(y|x) = Σ_z p_η(z|x) × p_θ(y|x, z)
         └── retriever ──┘  └── generator ──┘

For each query x, retrieve documents z, generate full sequence y.

RAG-Token: At each generation step, retrieve different documents

p(y_i|x, y_{1:i-1}) = Σ_z p_η(z|x, y_{1:i-1}) × p_θ(y_i|x, z, y_{1:i-1})

Can use different documents for different parts of the output. More flexible but more expensive.

In practice, RAG-Sequence is more commonly used in modern implementations.

The Dense Passage Retrieval (DPR) Component

This was itself a significant contribution. Prior to DPR, most retrieval used sparse methods (BM25 — keyword matching). DPR uses two separate BERT encoders — one for queries, one for documents:

# Conceptual DPR implementation
from transformers import DPRQuestionEncoder, DPRContextEncoder
import torch

question_encoder = DPRQuestionEncoder.from_pretrained('facebook/dpr-question_encoder-single-nq-base')
ctx_encoder = DPRContextEncoder.from_pretrained('facebook/dpr-ctx_encoder-single-nq-base')

def encode_query(question: str) -> torch.Tensor:
    inputs = tokenizer(question, return_tensors="pt")
    return question_encoder(**inputs).pooler_output

def encode_document(document: str) -> torch.Tensor:
    inputs = tokenizer(document, return_tensors="pt")
    return ctx_encoder(**inputs).pooler_output

# Similarity: dot product between query and document embeddings
def retrieve(query_vec: torch.Tensor, doc_vecs: torch.Tensor, top_k: int = 5):
    scores = query_vec @ doc_vecs.T
    return scores.topk(top_k)

DPR’s key advantage: learns to retrieve what actually helps answer questions, not just lexically similar text.

Key Results

The paper tested on Open-Domain Question Answering benchmarks:

Benchmark	Previous SOTA	RAG
NaturalQuestions (Exact Match)	44.5%	44.5% (tied)
WebQuestions (Exact Match)	42.7%	45.5%
TriviaQA (Exact Match)	67.7%	68.0%
MS-MARCO (BLEU-1)	—	62.5%

RAG matched or exceeded SOTA on most benchmarks while having the critical advantage of updatable knowledge — you can update the document store without retraining the model.

Additionally, RAG hallucinated less and produced more specific and factually accurate text than purely parametric generators.

What Changed Between 2020 and Today

The 2020 paper used BERT-based dense retrieval + BART generation. Modern RAG systems differ significantly:

Aspect	Paper (2020)	Modern RAG (2024+)
Generator	BART (400M params)	GPT-4, Claude (billions)
Retriever	DPR (BERT-based)	BGE, E5, Cohere Embed
Vector DB	FAISS (in-memory)	Pinecone, Weaviate, pgvector
Retrieval	Dense only	Hybrid (dense + BM25)
Reranking	None	Cross-encoder rerankers
Chunking	Fixed sentences	Semantic, hierarchical
Integration	End-to-end trained	Modular, no joint training

The key architectural difference: modern RAG is modular. The retriever and generator are not jointly trained — you plug in any embedding model and any LLM. This is far more practical and achieves better results due to better base models.

The RAG vs. Fine-tuning Debate

The paper’s finding on this is nuanced:

“RAG outperforms parametric-only seq2seq models… and provides more specific, consistent, and factual generations than a state-of-the-art parametric-only counterpart.”

But the paper also shows that for tasks where the LLM already has strong parametric knowledge, fine-tuning can compete with RAG.

The practical guidance that emerged from this work (and subsequent research):

RAG: for dynamic knowledge, external documents, citation-required tasks
Fine-tuning: for style, format, domain-specific behavior
Both: for the best results in production systems

Influence on Modern AI

The RAG architecture introduced in this paper is now ubiquitous:

ChatGPT with web browsing — essentially RAG with Bing
GitHub Copilot — RAG over your codebase
Perplexity AI — RAG over the web, with citations
Enterprise AI — RAG over company documents (the #1 LLM use case in 2024-2025)
Every LlamaIndex / LangChain / LlamaIndex tutorial — RAG pipeline

The paper’s terminology (retriever, generator, RAG) became the standard vocabulary of the field.

Frequently Asked Questions

Is RAG still relevant when models have million-token context windows?

Largely yes. Even with 1M token windows, RAG is more cost-effective than stuffing everything in context. At scale, retrieving 5 relevant chunks is far cheaper than processing 10,000 pages. For very specialized use cases (specific document), long context can replace retrieval.

What is the difference between RAG and search?

Traditional search returns ranked results for humans to read. RAG retrieves documents and has an LLM synthesize an answer. RAG produces a direct answer with source grounding; search requires the user to find the answer themselves.

Can RAG work without training the retriever?

Yes — modern RAG typically uses pre-trained embedding models (no fine-tuning needed). The 2020 paper trained DPR end-to-end with the generator, but this is rarely done today. Off-the-shelf embedding models work well enough for most applications.

What is “advanced RAG” vs. basic RAG?

Basic RAG: embed → retrieve → generate. Advanced RAG adds: query expansion, reranking, hybrid search, iterative retrieval, and post-retrieval filtering. LlamaIndex has built-in support for many of these.

Does the original Facebook code still work?

The Hugging Face Transformers library has an implementation (RagSequenceForGeneration) that preserves the original architecture. However, for production systems, you’d build a modern modular RAG pipeline using LlamaIndex or LangChain.

Next Steps

What Is RAG? — Conceptual overview with working code
LlamaIndex Advanced Retrieval Techniques — Production RAG with reranking and hybrid search
Pinecone vs Weaviate — Choose a vector database for your RAG system