Paper Overview
“Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., Kiela, D. (2020) Facebook AI Research (FAIR) Advances in Neural Information Processing Systems (NeurIPS 2020)
Why it matters: This paper coined the term “RAG” and defined the architecture that all modern RAG systems are built on. Before RAG, the only way to give LLMs access to external knowledge was fine-tuning or prompting with documents. RAG introduced a learnable retrieval mechanism trained end-to-end with the generator.
The Problem: LLMs Have Frozen Knowledge
The 2020 paper was written before LLMs became mainstream (GPT-3 was released the same year). The specific problem they addressed:
“Parametric” vs. “Non-parametric” memory
- Parametric: knowledge stored in model weights (frozen after training)
- Non-parametric: knowledge in an external store (can be updated)
LLMs are purely parametric. When you ask a model a factual question, it retrieves from memorized training data. This creates problems:
- The knowledge is stale — the model knows nothing after its training cutoff
- The knowledge is opaque — you can’t cite or verify what the model used
- The knowledge is rigid — updating requires expensive retraining
The solution: combine a parametric LM (for generation) with a non-parametric retrieval component (for knowledge lookup).
The RAG Architecture
Query
↓
Dense Retriever (DPR — Dense Passage Retrieval)
↓ retrieves top-k documents from Wikipedia
Generator (BART seq2seq model)
↓ generates answer conditioned on query + retrieved docs
Answer
Two sub-models are trained:
- Retriever (Dense Passage Retrieval, DPR): encodes queries and documents into dense vectors, retrieves top-k most relevant
- Generator (BART): generates the answer sequence, conditioned on the query and retrieved documents
RAG-Sequence vs. RAG-Token
The paper proposes two variants:
RAG-Sequence: Use the same set of retrieved documents for the entire output sequence
p(y|x) = Σ_z p_η(z|x) × p_θ(y|x, z)
└── retriever ──┘ └── generator ──┘
For each query x, retrieve documents z, generate full sequence y.
RAG-Token: At each generation step, retrieve different documents
p(y_i|x, y_{1:i-1}) = Σ_z p_η(z|x, y_{1:i-1}) × p_θ(y_i|x, z, y_{1:i-1})
Can use different documents for different parts of the output. More flexible but more expensive.
In practice, RAG-Sequence is more commonly used in modern implementations.
The Dense Passage Retrieval (DPR) Component
This was itself a significant contribution. Prior to DPR, most retrieval used sparse methods (BM25 — keyword matching). DPR uses two separate BERT encoders — one for queries, one for documents:
# Conceptual DPR implementation
from transformers import DPRQuestionEncoder, DPRContextEncoder
import torch
question_encoder = DPRQuestionEncoder.from_pretrained('facebook/dpr-question_encoder-single-nq-base')
ctx_encoder = DPRContextEncoder.from_pretrained('facebook/dpr-ctx_encoder-single-nq-base')
def encode_query(question: str) -> torch.Tensor:
inputs = tokenizer(question, return_tensors="pt")
return question_encoder(**inputs).pooler_output
def encode_document(document: str) -> torch.Tensor:
inputs = tokenizer(document, return_tensors="pt")
return ctx_encoder(**inputs).pooler_output
# Similarity: dot product between query and document embeddings
def retrieve(query_vec: torch.Tensor, doc_vecs: torch.Tensor, top_k: int = 5):
scores = query_vec @ doc_vecs.T
return scores.topk(top_k)
DPR’s key advantage: learns to retrieve what actually helps answer questions, not just lexically similar text.
Key Results
The paper tested on Open-Domain Question Answering benchmarks:
| Benchmark | Previous SOTA | RAG |
|---|---|---|
| NaturalQuestions (Exact Match) | 44.5% | 44.5% (tied) |
| WebQuestions (Exact Match) | 42.7% | 45.5% |
| TriviaQA (Exact Match) | 67.7% | 68.0% |
| MS-MARCO (BLEU-1) | — | 62.5% |
RAG matched or exceeded SOTA on most benchmarks while having the critical advantage of updatable knowledge — you can update the document store without retraining the model.
Additionally, RAG hallucinated less and produced more specific and factually accurate text than purely parametric generators.
What Changed Between 2020 and Today
The 2020 paper used BERT-based dense retrieval + BART generation. Modern RAG systems differ significantly:
| Aspect | Paper (2020) | Modern RAG (2024+) |
|---|---|---|
| Generator | BART (400M params) | GPT-4, Claude (billions) |
| Retriever | DPR (BERT-based) | BGE, E5, Cohere Embed |
| Vector DB | FAISS (in-memory) | Pinecone, Weaviate, pgvector |
| Retrieval | Dense only | Hybrid (dense + BM25) |
| Reranking | None | Cross-encoder rerankers |
| Chunking | Fixed sentences | Semantic, hierarchical |
| Integration | End-to-end trained | Modular, no joint training |
The key architectural difference: modern RAG is modular. The retriever and generator are not jointly trained — you plug in any embedding model and any LLM. This is far more practical and achieves better results due to better base models.
The RAG vs. Fine-tuning Debate
The paper’s finding on this is nuanced:
“RAG outperforms parametric-only seq2seq models… and provides more specific, consistent, and factual generations than a state-of-the-art parametric-only counterpart.”
But the paper also shows that for tasks where the LLM already has strong parametric knowledge, fine-tuning can compete with RAG.
The practical guidance that emerged from this work (and subsequent research):
- RAG: for dynamic knowledge, external documents, citation-required tasks
- Fine-tuning: for style, format, domain-specific behavior
- Both: for the best results in production systems
Influence on Modern AI
The RAG architecture introduced in this paper is now ubiquitous:
- ChatGPT with web browsing — essentially RAG with Bing
- GitHub Copilot — RAG over your codebase
- Perplexity AI — RAG over the web, with citations
- Enterprise AI — RAG over company documents (the #1 LLM use case in 2024-2025)
- Every LlamaIndex / LangChain / LlamaIndex tutorial — RAG pipeline
The paper’s terminology (retriever, generator, RAG) became the standard vocabulary of the field.
Frequently Asked Questions
Is RAG still relevant when models have million-token context windows?
Largely yes. Even with 1M token windows, RAG is more cost-effective than stuffing everything in context. At scale, retrieving 5 relevant chunks is far cheaper than processing 10,000 pages. For very specialized use cases (specific document), long context can replace retrieval.
What is the difference between RAG and search?
Traditional search returns ranked results for humans to read. RAG retrieves documents and has an LLM synthesize an answer. RAG produces a direct answer with source grounding; search requires the user to find the answer themselves.
Can RAG work without training the retriever?
Yes — modern RAG typically uses pre-trained embedding models (no fine-tuning needed). The 2020 paper trained DPR end-to-end with the generator, but this is rarely done today. Off-the-shelf embedding models work well enough for most applications.
What is “advanced RAG” vs. basic RAG?
Basic RAG: embed → retrieve → generate. Advanced RAG adds: query expansion, reranking, hybrid search, iterative retrieval, and post-retrieval filtering. LlamaIndex has built-in support for many of these.
Does the original Facebook code still work?
The Hugging Face Transformers library has an implementation (RagSequenceForGeneration) that preserves the original architecture. However, for production systems, you’d build a modern modular RAG pipeline using LlamaIndex or LangChain.
Next Steps
- What Is RAG? — Conceptual overview with working code
- LlamaIndex Advanced Retrieval Techniques — Production RAG with reranking and hybrid search
- Pinecone vs Weaviate — Choose a vector database for your RAG system