Document Ingestion: The Foundation of RAG
A RAG pipeline is only as good as the documents feeding it. LlamaIndex excels at document ingestion — it ships with readers for 40+ file formats and a premium parser (LlamaParse) for complex PDFs. Getting your data in correctly dramatically affects retrieval quality.
This guide covers every major ingestion scenario, from local files to web scraping to cloud storage.
SimpleDirectoryReader: Load Any Directory
SimpleDirectoryReader is the workhorse for local file ingestion. It auto-detects file types:
from llama_index.core import SimpleDirectoryReader
# Load everything in a folder (auto-detects .txt, .pdf, .docx, .md, .csv, etc.)
documents = SimpleDirectoryReader("data/").load_data()
print(f"Loaded {len(documents)} document(s)")
# Load specific file types only
pdf_docs = SimpleDirectoryReader(
"data/",
required_exts=[".pdf"],
recursive=True, # include subdirectories
).load_data()
# Load specific files
selected = SimpleDirectoryReader(
input_files=["data/manual.pdf", "data/faq.md", "data/products.csv"]
).load_data()
Every document gets metadata automatically:
for doc in documents:
print(doc.metadata)
# {'file_path': 'data/report.pdf', 'file_name': 'report.pdf', 'file_type': 'application/pdf', ...}
Loading Web Pages
from llama_index.readers.web import SimpleWebPageReader
# Scrape and load web pages
reader = SimpleWebPageReader(html_to_text=True)
docs = reader.load_data([
"https://docs.langchain.com/",
"https://docs.crewai.com/",
])
Loading from Databases
from llama_index.readers.database import DatabaseReader
reader = DatabaseReader(
sql_database="postgresql://user:pass@localhost/mydb"
)
# Load from a SQL query
docs = reader.load_data(
query="SELECT id, title, content, updated_at FROM articles WHERE active = true"
)
LlamaParse: High-Quality PDF Parsing
Standard PDF readers lose tables, columns, and formatting. LlamaParse (from LlamaIndex) uses a specialized model to extract structure from complex PDFs:
pip install llama-parse
export LLAMA_CLOUD_API_KEY="llx-your-key" # free tier available
from llama_parse import LlamaParse
from llama_index.core import VectorStoreIndex
# Parse a complex PDF (financial report, technical manual, etc.)
parser = LlamaParse(
result_type="markdown", # markdown preserves structure
# result_type="text", # plain text, faster
verbose=True,
)
documents = parser.load_data("data/annual_report.pdf")
# Build index as normal
index = VectorStoreIndex.from_documents(documents)
LlamaParse correctly handles:
- Multi-column layouts
- Tables (converted to markdown tables)
- Charts (described in text)
- Headers and footers
- Page numbers
Custom Metadata Enrichment
Add custom metadata to documents for better retrieval filtering:
from llama_index.core import SimpleDirectoryReader, Document
def add_custom_metadata(documents: list) -> list:
enriched = []
for doc in documents:
# Add department based on filename
filename = doc.metadata.get("file_name", "")
if "finance" in filename:
department = "finance"
elif "hr" in filename:
department = "hr"
else:
department = "general"
# Add metadata
doc.metadata["department"] = department
doc.metadata["indexed_at"] = "2026-04-08"
enriched.append(doc)
return enriched
documents = SimpleDirectoryReader("data/").load_data()
documents = add_custom_metadata(documents)
# Now you can filter by metadata at query time
index = VectorStoreIndex.from_documents(documents)
retriever = index.as_retriever(
filters=[{"key": "department", "value": "finance", "operator": "=="}]
)
Chunking Strategy: Choosing the Right Splitter
How you split documents is critical. Smaller chunks = more precise retrieval. Larger chunks = more context per answer.
from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import (
SentenceSplitter, # default: split by sentence
SemanticSplitterNodeParser, # split by semantic meaning (best quality)
MarkdownNodeParser, # split by markdown headers
CodeSplitter, # split code by function/class
)
from llama_index.embeddings.openai import OpenAIEmbedding
# Option 1: Sentence splitter (fast, good default)
sentence_splitter = SentenceSplitter(
chunk_size=512,
chunk_overlap=50,
)
# Option 2: Semantic splitter (best quality, slower)
semantic_splitter = SemanticSplitterNodeParser(
buffer_size=1,
breakpoint_percentile_threshold=95,
embed_model=OpenAIEmbedding(model="text-embedding-3-small"),
)
# Option 3: Markdown splitter (for docs with headers)
md_splitter = MarkdownNodeParser()
# Apply when building index
index = VectorStoreIndex.from_documents(
documents,
transformations=[sentence_splitter],
)
Incremental Updates: Add New Documents Without Re-Indexing
from llama_index.core import VectorStoreIndex, StorageContext, load_index_from_storage
# First time: build and save
index = VectorStoreIndex.from_documents(documents)
index.storage_context.persist("./storage")
# Later: load existing index
storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)
# Add only new documents (no re-embedding of existing ones)
new_docs = SimpleDirectoryReader("data/new/").load_data()
for doc in new_docs:
index.insert(doc)
# Save updated index
index.storage_context.persist("./storage")
Document Transformations Pipeline
Chain multiple transformations before indexing:
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.extractors import TitleExtractor, QuestionsAnsweredExtractor
from llama_index.embeddings.openai import OpenAIEmbedding
pipeline = IngestionPipeline(
transformations=[
SentenceSplitter(chunk_size=512, chunk_overlap=50),
TitleExtractor(), # extract title for each chunk
QuestionsAnsweredExtractor(questions=3), # generate Q&A pairs
OpenAIEmbedding(model="text-embedding-3-small"),
]
)
nodes = pipeline.run(documents=documents)
index = VectorStoreIndex(nodes)
QuestionsAnsweredExtractor is powerful — it generates the questions each chunk answers. This dramatically improves retrieval for question-answering use cases.
Frequently Asked Questions
How do I handle scanned PDFs (images of text)?
LlamaParse handles OCR automatically — it recognizes scanned documents and applies OCR. For non-LlamaParse workflows, use pytesseract or unstructured:
pip install unstructured[pdf]
from llama_index.readers.file import UnstructuredReader
reader = UnstructuredReader()
docs = reader.load_data(file="scanned.pdf")
What’s the best chunk size for different use cases?
- Q&A / chatbots: 256–512 tokens (precise retrieval)
- Summarization: 1024–2048 tokens (more context)
- Code documentation: Use
CodeSplitterby function - Legal/technical docs: 512–1024 tokens with overlap 100
Test different sizes on your data — there’s no universal answer.
How do I handle documents that update frequently?
Use a document ID scheme that lets you delete and re-insert:
from llama_index.core import VectorStoreIndex
# Use file path as unique doc_id
index.delete_ref_doc("data/manual.pdf", delete_from_docstore=True)
new_doc = SimpleDirectoryReader(input_files=["data/manual.pdf"]).load_data()[0]
index.insert(new_doc)
Can I load from S3 or Google Cloud Storage?
Yes:
pip install llama-index-readers-s3
from llama_index.readers.s3 import S3Reader
reader = S3Reader(bucket="my-bucket", prefix="docs/", aws_access_id="...", aws_access_secret="...")
docs = reader.load_data()
Next Steps
- Getting Started with LlamaIndex — Build your first RAG pipeline
- LlamaIndex Advanced Retrieval Techniques — Optimize retrieval quality after loading
- Pinecone vs Weaviate — Choose a production vector store for your index