If you’ve been exploring Retrieval-Augmented Generation (RAG), you’ve likely hit a wall: how do you store and search knowledge in a way that an LLM can actually use? This Introduction to Vector Databases: Storing and Retrieving Data for RAG answers that question from the ground up — walking you through the concepts, the tooling, and a complete working implementation you can run today. By the end, you’ll have a local vector store loaded with documents and a retrieval pipeline ready to plug into any RAG workflow.
What Is a Vector Database and Why Does RAG Need One?
A vector database is a specialized data store designed to hold and search embedding vectors — high-dimensional numerical representations of text, images, or other data. Unlike a traditional SQL database that matches rows by exact values, a vector database finds records by semantic similarity.
Here’s why that matters for RAG. When a user asks “What are the side effects of ibuprofen?”, a keyword search might miss documents that say “adverse reactions to NSAIDs.” An embedding-based search converts both the query and the documents into vectors in the same semantic space, then retrieves the documents whose vectors are closest to the query vector — regardless of exact wording.
The core operation is called approximate nearest neighbor (ANN) search, and it’s what makes retrieval fast even across millions of documents.
Popular vector databases you’ll encounter:
| Database | Best for | Hosted option |
|---|---|---|
| Chroma | Local dev, quick prototypes | No |
| Pinecone | Production, managed | Yes |
| Qdrant | Self-hosted production | Yes |
| Weaviate | Multi-modal, hybrid search | Yes |
| pgvector | PostgreSQL-native | Yes |
For this tutorial we’ll use Chroma — it runs in-process with no Docker required, making it perfect for learning the fundamentals before you move to a production store.
How the RAG Pipeline Fits Together
Before writing any code, it’s worth visualizing the full flow from raw documents to a grounded LLM answer.
flowchart TD
A[Raw Documents] --> B[Text Chunker]
B --> C[Embedding Model]
C --> D[(Vector Database)]
E[User Query] --> F[Embedding Model]
F --> G[ANN Search]
D --> G
G --> H[Top-K Chunks]
H --> I[LLM with Context]
I --> J[Grounded Answer]
The pipeline has two phases:
- Indexing (offline): chunk documents → embed → store in vector DB
- Retrieval (online): embed the query → ANN search → inject retrieved chunks into the LLM prompt
This separation is what makes RAG scalable. You index once and query many times. For a deeper look at what this architecture enables, see LangChain Agents and Tools: Build Agents That Take Action, which shows how retrieval tools integrate with agent loops.
Environment Setup
Install the required packages. We’ll use Chroma for the vector store, sentence-transformers for local embeddings, and openai for the final LLM call.
pip install chromadb sentence-transformers openai python-dotenv
Create a .env file for your OpenAI key:
OPENAI_API_KEY=sk-...
Confirm the install works:
import chromadb
from sentence_transformers import SentenceTransformer
client = chromadb.Client()
model = SentenceTransformer("all-MiniLM-L6-v2")
test_vec = model.encode("hello world")
print(f"Embedding dimension: {len(test_vec)}") # → 384
Core Concepts: Chunks, Embeddings, and Collections
Chunking
LLMs have token limits and embedding models work best on short passages. Chunking splits long documents into overlapping segments so no relevant context falls between cracks.
def chunk_text(text: str, chunk_size: int = 400, overlap: int = 50) -> list[str]:
"""Split text into overlapping chunks by character count."""
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunks.append(text[start:end])
start += chunk_size - overlap
return chunks
The overlap parameter ensures a sentence split across a boundary still appears whole in at least one chunk.
Embeddings
An embedding model maps text to a fixed-length float vector. Semantically similar texts land near each other in this high-dimensional space. We’re using all-MiniLM-L6-v2 — a 384-dimension model that runs fast on CPU and performs well for English retrieval tasks.
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer("all-MiniLM-L6-v2")
sentences = [
"The cat sat on the mat.",
"A feline rested on a rug.",
"The stock market fell sharply.",
]
vecs = embedder.encode(sentences)
print(vecs.shape) # (3, 384)
Vectors 0 and 1 will be much closer together than either is to vector 2, because the first two are semantically similar.
Collections
In Chroma, a collection is the equivalent of a table — it holds documents, their embeddings, and optional metadata. You interact with one collection per knowledge domain.
Building the Full RAG Pipeline
We’ll build a self-contained script that:
- Loads sample documents
- Chunks and embeds them
- Stores them in Chroma
- Retrieves relevant chunks for a query
- Calls the OpenAI API with context
# rag_pipeline.py
import os
from dotenv import load_dotenv
import chromadb
from chromadb.config import Settings
from sentence_transformers import SentenceTransformer
from openai import OpenAI
load_dotenv()
# ── Configuration ─────────────────────────────────────────────────────────────
EMBED_MODEL = "all-MiniLM-L6-v2"
COLLECTION_NAME = "knowledge_base"
TOP_K = 3
CHUNK_SIZE = 400
CHUNK_OVERLAP = 50
# Sample documents — replace with file I/O for real use
DOCUMENTS = [
{
"id": "doc1",
"text": (
"LangChain is a framework for building applications powered by large language models. "
"It provides abstractions for chains, agents, memory, and tool use. "
"LangChain supports many LLM providers including OpenAI, Anthropic, and Mistral."
),
"source": "langchain_overview.txt",
},
{
"id": "doc2",
"text": (
"Retrieval-Augmented Generation (RAG) is a technique that grounds LLM responses "
"in external documents. Instead of relying solely on parametric knowledge baked into "
"the model weights, RAG fetches relevant context at inference time. This reduces "
"hallucinations and keeps answers up to date without retraining."
),
"source": "rag_overview.txt",
},
{
"id": "doc3",
"text": (
"Vector databases store high-dimensional embeddings and support approximate nearest "
"neighbor (ANN) search. Common choices include Chroma, Pinecone, Qdrant, and pgvector. "
"Chroma is popular for local development because it requires no external service."
),
"source": "vector_db_overview.txt",
},
{
"id": "doc4",
"text": (
"Prompt engineering is the practice of crafting inputs to LLMs to elicit desired "
"outputs. Techniques include few-shot examples, chain-of-thought reasoning, and "
"structured output formatting. System prompts establish persona and constraints."
),
"source": "prompt_engineering.txt",
},
]
# ── Helpers ───────────────────────────────────────────────────────────────────
def chunk_text(text: str, chunk_size: int, overlap: int) -> list[str]:
chunks, start = [], 0
while start < len(text):
chunks.append(text[start : start + chunk_size])
start += chunk_size - overlap
return chunks
# ── Step 1: Initialize clients ────────────────────────────────────────────────
embedder = SentenceTransformer(EMBED_MODEL)
chroma_client = chromadb.Client(Settings(anonymized_telemetry=False))
collection = chroma_client.get_or_create_collection(name=COLLECTION_NAME)
openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# ── Step 2: Index documents ───────────────────────────────────────────────────
def index_documents(docs: list[dict]) -> None:
all_ids, all_texts, all_embeddings, all_metadata = [], [], [], []
for doc in docs:
chunks = chunk_text(doc["text"], CHUNK_SIZE, CHUNK_OVERLAP)
for i, chunk in enumerate(chunks):
chunk_id = f"{doc['id']}_chunk{i}"
embedding = embedder.encode(chunk).tolist()
all_ids.append(chunk_id)
all_texts.append(chunk)
all_embeddings.append(embedding)
all_metadata.append({"source": doc["source"], "doc_id": doc["id"]})
collection.add(
ids=all_ids,
documents=all_texts,
embeddings=all_embeddings,
metadatas=all_metadata,
)
print(f"Indexed {len(all_ids)} chunks from {len(docs)} documents.")
# ── Step 3: Retrieve relevant chunks ─────────────────────────────────────────
def retrieve(query: str, top_k: int = TOP_K) -> list[dict]:
query_embedding = embedder.encode(query).tolist()
results = collection.query(
query_embeddings=[query_embedding],
n_results=top_k,
include=["documents", "metadatas", "distances"],
)
retrieved = []
for doc, meta, dist in zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0],
):
retrieved.append({"text": doc, "source": meta["source"], "distance": dist})
return retrieved
# ── Step 4: Generate a grounded answer ───────────────────────────────────────
def answer(query: str) -> str:
chunks = retrieve(query)
context = "\n\n".join(
f"[Source: {c['source']}]\n{c['text']}" for c in chunks
)
system_prompt = (
"You are a helpful assistant. Answer the user's question using only "
"the provided context. If the context doesn't contain the answer, "
"say 'I don't have enough information.'"
)
user_prompt = f"Context:\n{context}\n\nQuestion: {query}"
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
],
temperature=0.2,
)
return response.choices[0].message.content
# ── Main ──────────────────────────────────────────────────────────────────────
if __name__ == "__main__":
index_documents(DOCUMENTS)
queries = [
"What is RAG and why does it reduce hallucinations?",
"Which vector databases work well for local development?",
"How does LangChain relate to LLMs?",
]
for q in queries:
print(f"\nQ: {q}")
print(f"A: {answer(q)}")
Run it:
python rag_pipeline.py
Expected output (truncated):
Indexed 4 chunks from 4 documents.
Q: What is RAG and why does it reduce hallucinations?
A: RAG, or Retrieval-Augmented Generation, is a technique that grounds LLM responses in external documents...
Q: Which vector databases work well for local development?
A: Chroma is popular for local development because it requires no external service...
Production Patterns to Know Before You Scale
Once you move beyond a local prototype, a few patterns become essential:
Persistent storage — By default, Chroma runs in-memory. Switch to disk persistence:
client = chromadb.PersistentClient(path="./chroma_data")
Batch embedding — Encode all chunks in one call instead of a loop; most embedding models parallelize internally:
embeddings = embedder.encode(all_texts, batch_size=64, show_progress_bar=True)
Metadata filtering — Chroma supports where clauses to pre-filter before ANN search, which is useful when you have multi-tenant data:
results = collection.query(
query_embeddings=[query_vec],
n_results=5,
where={"source": "langchain_overview.txt"},
)
Hybrid search — For higher precision, combine dense (vector) retrieval with sparse (BM25) retrieval and re-rank with a cross-encoder. Frameworks like LangChain Agents and Tools provide ensemble retrievers that wire this up for you.
Chunking strategy — For structured documents (Markdown, HTML), prefer semantic chunking that splits on headings and paragraphs rather than fixed character counts. The quality of your chunks is often the single biggest lever on retrieval quality.
When you’re ready to layer agent reasoning on top of your retrieval pipeline, OpenClaw Multi-Agent System: Run a Team of Specialized AIs shows how dedicated retriever agents can hand off context to other agents in a multi-step workflow.
Frequently Asked Questions
What’s the difference between a vector database and a traditional database?
A traditional database matches records by exact or pattern-based criteria (SQL WHERE clauses, full-text LIKE queries). A vector database indexes floating-point vectors and retrieves the records whose vectors are geometrically closest to a query vector — enabling semantic search where meaning matters more than exact wording. They’re complementary: many production RAG systems use both, filtering metadata in SQL first, then doing ANN search on the reduced set.
How do I choose between Chroma, Pinecone, and Qdrant?
Use Chroma when you want zero infrastructure overhead during development. Use Pinecone when you need a fully managed, horizontally scalable cloud service and don’t want to operate servers. Use Qdrant when you want self-hosted production control, rich filtering, and an active open-source community. For a small project or a learning environment, start with Chroma and migrate when you have real scale requirements.
How many chunks should I retrieve (top-K)?
Start with top_k=3 to 5. Too few and you miss relevant context; too many and you flood the LLM’s context window with noise, increasing cost and reducing answer quality. If your LLM supports a large context window (128K+), you can afford higher K — but always measure retrieval precision before assuming more is better.
Does the choice of embedding model matter a lot?
Yes. The embedding model determines how well semantically similar texts cluster together. all-MiniLM-L6-v2 is a solid lightweight choice for English. For multilingual content or higher accuracy, consider text-embedding-3-large (OpenAI) or bge-large-en (BAAI). Always benchmark on a sample of your real queries before committing to a model in production.
Can I update documents in the vector store without re-indexing everything?
Yes. Vector databases support upsert operations — you can add new chunks, delete chunks by ID, or update embeddings for changed documents independently. Chroma exposes collection.upsert() and collection.delete() for this. In practice, maintain a mapping between your source document IDs and their chunk IDs so you can efficiently replace only what changed.