Beginner Llamaindex Tutorial 2 min read

LlamaIndex Document Parsing: Load Any File into Your RAG Pipeline

#llamaindex #document-parsing #pdf #ingestion #simpledirectoryreader #llama-parse

Document Ingestion: The Foundation of RAG

A RAG pipeline is only as good as the documents feeding it. LlamaIndex excels at document ingestion — it ships with readers for 40+ file formats and a premium parser (LlamaParse) for complex PDFs. Getting your data in correctly dramatically affects retrieval quality.

This guide covers every major ingestion scenario, from local files to web scraping to cloud storage.

SimpleDirectoryReader: Load Any Directory

SimpleDirectoryReader is the workhorse for local file ingestion. It auto-detects file types:

from llama_index.core import SimpleDirectoryReader

# Load everything in a folder (auto-detects .txt, .pdf, .docx, .md, .csv, etc.)
documents = SimpleDirectoryReader("data/").load_data()
print(f"Loaded {len(documents)} document(s)")

# Load specific file types only
pdf_docs = SimpleDirectoryReader(
    "data/",
    required_exts=[".pdf"],
    recursive=True,  # include subdirectories
).load_data()

# Load specific files
selected = SimpleDirectoryReader(
    input_files=["data/manual.pdf", "data/faq.md", "data/products.csv"]
).load_data()

Every document gets metadata automatically:

for doc in documents:
    print(doc.metadata)
    # {'file_path': 'data/report.pdf', 'file_name': 'report.pdf', 'file_type': 'application/pdf', ...}

Loading Web Pages

from llama_index.readers.web import SimpleWebPageReader

# Scrape and load web pages
reader = SimpleWebPageReader(html_to_text=True)
docs = reader.load_data([
    "https://docs.langchain.com/",
    "https://docs.crewai.com/",
])

Loading from Databases

from llama_index.readers.database import DatabaseReader

reader = DatabaseReader(
    sql_database="postgresql://user:pass@localhost/mydb"
)

# Load from a SQL query
docs = reader.load_data(
    query="SELECT id, title, content, updated_at FROM articles WHERE active = true"
)

LlamaParse: High-Quality PDF Parsing

Standard PDF readers lose tables, columns, and formatting. LlamaParse (from LlamaIndex) uses a specialized model to extract structure from complex PDFs:

pip install llama-parse
export LLAMA_CLOUD_API_KEY="llx-your-key"  # free tier available
from llama_parse import LlamaParse
from llama_index.core import VectorStoreIndex

# Parse a complex PDF (financial report, technical manual, etc.)
parser = LlamaParse(
    result_type="markdown",   # markdown preserves structure
    # result_type="text",     # plain text, faster
    verbose=True,
)

documents = parser.load_data("data/annual_report.pdf")

# Build index as normal
index = VectorStoreIndex.from_documents(documents)

LlamaParse correctly handles:

  • Multi-column layouts
  • Tables (converted to markdown tables)
  • Charts (described in text)
  • Headers and footers
  • Page numbers

Custom Metadata Enrichment

Add custom metadata to documents for better retrieval filtering:

from llama_index.core import SimpleDirectoryReader, Document

def add_custom_metadata(documents: list) -> list:
    enriched = []
    for doc in documents:
        # Add department based on filename
        filename = doc.metadata.get("file_name", "")
        if "finance" in filename:
            department = "finance"
        elif "hr" in filename:
            department = "hr"
        else:
            department = "general"

        # Add metadata
        doc.metadata["department"] = department
        doc.metadata["indexed_at"] = "2026-04-08"
        enriched.append(doc)
    return enriched

documents = SimpleDirectoryReader("data/").load_data()
documents = add_custom_metadata(documents)

# Now you can filter by metadata at query time
index = VectorStoreIndex.from_documents(documents)
retriever = index.as_retriever(
    filters=[{"key": "department", "value": "finance", "operator": "=="}]
)

Chunking Strategy: Choosing the Right Splitter

How you split documents is critical. Smaller chunks = more precise retrieval. Larger chunks = more context per answer.

from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import (
    SentenceSplitter,         # default: split by sentence
    SemanticSplitterNodeParser,  # split by semantic meaning (best quality)
    MarkdownNodeParser,       # split by markdown headers
    CodeSplitter,             # split code by function/class
)
from llama_index.embeddings.openai import OpenAIEmbedding

# Option 1: Sentence splitter (fast, good default)
sentence_splitter = SentenceSplitter(
    chunk_size=512,
    chunk_overlap=50,
)

# Option 2: Semantic splitter (best quality, slower)
semantic_splitter = SemanticSplitterNodeParser(
    buffer_size=1,
    breakpoint_percentile_threshold=95,
    embed_model=OpenAIEmbedding(model="text-embedding-3-small"),
)

# Option 3: Markdown splitter (for docs with headers)
md_splitter = MarkdownNodeParser()

# Apply when building index
index = VectorStoreIndex.from_documents(
    documents,
    transformations=[sentence_splitter],
)

Incremental Updates: Add New Documents Without Re-Indexing

from llama_index.core import VectorStoreIndex, StorageContext, load_index_from_storage

# First time: build and save
index = VectorStoreIndex.from_documents(documents)
index.storage_context.persist("./storage")

# Later: load existing index
storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)

# Add only new documents (no re-embedding of existing ones)
new_docs = SimpleDirectoryReader("data/new/").load_data()
for doc in new_docs:
    index.insert(doc)

# Save updated index
index.storage_context.persist("./storage")

Document Transformations Pipeline

Chain multiple transformations before indexing:

from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.extractors import TitleExtractor, QuestionsAnsweredExtractor
from llama_index.embeddings.openai import OpenAIEmbedding

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=512, chunk_overlap=50),
        TitleExtractor(),           # extract title for each chunk
        QuestionsAnsweredExtractor(questions=3),  # generate Q&A pairs
        OpenAIEmbedding(model="text-embedding-3-small"),
    ]
)

nodes = pipeline.run(documents=documents)
index = VectorStoreIndex(nodes)

QuestionsAnsweredExtractor is powerful — it generates the questions each chunk answers. This dramatically improves retrieval for question-answering use cases.

Frequently Asked Questions

How do I handle scanned PDFs (images of text)?

LlamaParse handles OCR automatically — it recognizes scanned documents and applies OCR. For non-LlamaParse workflows, use pytesseract or unstructured:

pip install unstructured[pdf]
from llama_index.readers.file import UnstructuredReader
reader = UnstructuredReader()
docs = reader.load_data(file="scanned.pdf")

What’s the best chunk size for different use cases?

  • Q&A / chatbots: 256–512 tokens (precise retrieval)
  • Summarization: 1024–2048 tokens (more context)
  • Code documentation: Use CodeSplitter by function
  • Legal/technical docs: 512–1024 tokens with overlap 100

Test different sizes on your data — there’s no universal answer.

How do I handle documents that update frequently?

Use a document ID scheme that lets you delete and re-insert:

from llama_index.core import VectorStoreIndex

# Use file path as unique doc_id
index.delete_ref_doc("data/manual.pdf", delete_from_docstore=True)
new_doc = SimpleDirectoryReader(input_files=["data/manual.pdf"]).load_data()[0]
index.insert(new_doc)

Can I load from S3 or Google Cloud Storage?

Yes:

pip install llama-index-readers-s3
from llama_index.readers.s3 import S3Reader
reader = S3Reader(bucket="my-bucket", prefix="docs/", aws_access_id="...", aws_access_secret="...")
docs = reader.load_data()

Next Steps

Related Articles