What Is LlamaIndex?
LlamaIndex is an open-source Python framework designed to connect large language models (LLMs) to your own data. While a bare LLM only knows what it was trained on, LlamaIndex lets you ingest documents, PDFs, websites, or databases — and query them with natural language in seconds.
The core use case is Retrieval-Augmented Generation (RAG): instead of fine-tuning a model, you store your documents in a vector index and retrieve only the relevant chunks at query time. This keeps answers grounded in your actual data, reduces hallucinations, and works with any LLM provider.
LlamaIndex is the go-to choice when you need to build production RAG pipelines or AI agents that reason over large document collections.
Installing LlamaIndex
You need Python 3.9+ and an OpenAI API key (or any supported LLM).
pip install llama-index llama-index-llms-openai llama-index-embeddings-openai
Set your API key as an environment variable:
export OPENAI_API_KEY="sk-your-key-here"
On Windows:
set OPENAI_API_KEY=sk-your-key-here
Core Concepts
Before writing code, understand the four building blocks:
1. Documents — Raw text loaded from files, URLs, or databases. LlamaIndex ships with dozens of readers (PDF, Word, Notion, Slack, etc.).
2. Nodes — Chunks of a Document after parsing. The chunking strategy directly affects retrieval quality.
3. Index — A data structure (usually a vector store) that makes documents searchable. VectorStoreIndex embeds every node and stores vectors in memory by default.
4. Query Engine — Accepts a natural-language question, retrieves the top-k relevant nodes, and synthesizes an answer using your LLM.
Configuring the LLM and Embeddings
LlamaIndex uses a global Settings object to configure the default LLM and embedding model:
import os
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
os.environ["OPENAI_API_KEY"] = "sk-your-key-here"
# Set the language model
Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)
# Set the embedding model
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Setting these globally means every index and query engine you create will use them automatically — no need to pass them everywhere.
Loading Documents
Create a directory called data/ and drop any .txt or .pdf files in it. Then load them with SimpleDirectoryReader:
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader("data").load_data()
print(f"Loaded {len(documents)} document(s)")
SimpleDirectoryReader automatically handles .txt, .pdf, .docx, .md, and many other formats. Each file becomes one or more Document objects.
Building the Index
Pass the documents to VectorStoreIndex. This embeds every chunk and stores the vectors in memory:
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)
Behind the scenes, LlamaIndex:
- Splits each document into nodes (default: 1024 tokens per chunk, 20-token overlap)
- Calls the embedding model for each node
- Stores
(embedding, text, metadata)in a vector store
For small datasets this happens in seconds. The index is held in memory by default.
Querying the Index
Create a query engine and ask it a question:
query_engine = index.as_query_engine()
response = query_engine.query("What are the main topics covered?")
print(response)
That’s it — a working RAG pipeline in under 20 lines of Python.
Complete Working Example
Here is the full script from scratch:
import os
from llama_index.core import Settings, SimpleDirectoryReader, VectorStoreIndex
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
# 1. Configure models
os.environ["OPENAI_API_KEY"] = "sk-your-key-here"
Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
# 2. Load documents
documents = SimpleDirectoryReader("data").load_data()
print(f"Loaded {len(documents)} document(s)")
# 3. Build index
index = VectorStoreIndex.from_documents(documents)
# 4. Query
query_engine = index.as_query_engine(similarity_top_k=3)
response = query_engine.query("Summarize the key points in these documents.")
print(response)
Run it:
python rag_pipeline.py
Persisting the Index
Rebuilding the index on every run is slow and wastes API credits. Persist it to disk and reload:
from llama_index.core import StorageContext, load_index_from_storage
# Save
index.storage_context.persist(persist_dir="./storage")
# Later: reload without re-embedding
storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)
Adjusting Retrieval Quality
Two parameters control retrieval quality:
similarity_top_k— How many chunks to retrieve per query. Default is 2. Increase to 5–10 for comprehensive answers.chunk_size— Node size in tokens. Smaller chunks (256–512) improve precision; larger chunks (1024–2048) improve context.
from llama_index.core.node_parser import SentenceSplitter
# Use 512-token chunks with 50-token overlap
splitter = SentenceSplitter(chunk_size=512, chunk_overlap=50)
index = VectorStoreIndex.from_documents(documents, transformations=[splitter])
# Retrieve top 5 chunks per query
query_engine = index.as_query_engine(similarity_top_k=5)
Frequently Asked Questions
Is LlamaIndex the same as LangChain?
They overlap but have different strengths. LlamaIndex is purpose-built for RAG and data indexing — it has more data connectors, indexing strategies, and retrieval primitives. LangChain is a broader agent-orchestration framework. Many developers use both: LlamaIndex for retrieval, LangChain for agent orchestration. See LlamaIndex vs LangChain for RAG for a detailed comparison.
Does LlamaIndex only work with OpenAI?
No. LlamaIndex supports Anthropic Claude, Cohere, Mistral, Ollama (local models), HuggingFace, and dozens more. Swap OpenAI() for Anthropic() from llama-index-llms-anthropic and the rest of the pipeline stays the same.
How do I use a persistent vector database like Pinecone?
Replace the default in-memory store with a Pinecone integration:
from llama_index.vector_stores.pinecone import PineconeVectorStore
import pinecone
pinecone.init(api_key="your-key", environment="us-west1-gcp")
vector_store = PineconeVectorStore(index_name="my-index")
index = VectorStoreIndex.from_documents(documents, vector_store=vector_store)
What is the difference between as_query_engine() and as_chat_engine()?
as_query_engine() treats each call independently — no memory of previous questions. as_chat_engine() maintains conversation history and is better for chatbot use cases:
chat_engine = index.as_chat_engine(chat_mode="context")
response = chat_engine.chat("What does document 2 say about pricing?")
follow_up = chat_engine.chat("Can you elaborate on that?") # remembers context
How much does it cost to index 100 PDF pages?
With text-embedding-3-small ($0.02 per million tokens), 100 pages (~50,000 tokens) costs roughly $0.001. Query costs depend on answer length; gpt-4o-mini at ~$0.15/million input tokens is very affordable for RAG workloads.
Next Steps
Now that you have a working RAG pipeline, explore these natural next steps:
- LangChain Agents and Tools — Add tool use and agent reasoning on top of retrieval
- Introduction to LangChain — Understand how LangChain and LlamaIndex complement each other
- LlamaIndex vs LangChain for RAG — Compare the two frameworks side-by-side