Advanced RAG Techniques: Beyond Simple Vector Search

Q: How much does re-ranking slow down my pipeline?

The ms-marco-MiniLM-L-6-v2 cross-encoder running on CPU adds roughly 20–80ms for re-ranking 50 candidates, depending on chunk length. On a GPU (even a T4), it's under 5ms. The latency cost is almost always worth the quality gain. If you're latency-constrained, reduce your initial retrieval pool (top-30 instead of top-50) rather than skipping re-ranking.

Q: Can I use advanced RAG techniques without LangChain?

Yes. The concepts are framework-agnostic. CrossEncoder from sentence-transformers works standalone. BM25 via rank-bm25 requires no orchestration framework. The main LangChain dependency in these examples is EnsembleRetriever for hybrid search — you can replace it with manual RRF logic in about 15 lines of Python.

Q: How do I evaluate whether advanced RAG is actually better than simple RAG?

Use RAGAS (Retrieval Augmented Generation Assessment) to measure four metrics: answer faithfulness (does the answer match the context?), answer relevancy (does the answer address the question?), context precision (are retrieved docs relevant?), and context recall (were relevant docs retrieved?). Run both pipelines against a 50–100 question golden dataset and compare. Advanced RAG should improve context precision and recall significantly; answer quality gains follow from those.

If you’ve been building AI applications for a while, you’ve likely encountered the limitations of naive vector similarity search. Advanced RAG Techniques: Beyond Simple Vector Search is not just a performance upgrade — it’s a fundamental shift in how you architect retrieval systems for production. Simple cosine similarity against a flat vector index works in demos, but it breaks down on real corpora with overlapping topics, ambiguous queries, and latency constraints that matter to actual users.

This tutorial walks through the most impactful advanced RAG patterns — hybrid search, re-ranking, HyDE (Hypothetical Document Embeddings), multi-vector retrieval, and contextual compression — with complete, production-ready Python code you can drop into your own stack.

Why Simple Vector Search Breaks in Production

Before diving into solutions, you need to understand what actually fails. A naive RAG pipeline embeds your query, does a k-nearest-neighbor lookup in a vector store, and feeds the top-k chunks to an LLM. Three problems kill this approach at scale:

Semantic drift: Embeddings compress semantics into fixed-length vectors. Rare technical terms, proper nouns, and version numbers are often poorly represented. A query for “LangChain 0.2 breaking changes” may retrieve documents about LangChain in general, not the version-specific content you need.

Chunk granularity mismatch: If your chunks are too large, a single chunk contains multiple concepts and retrieves with low precision. Too small, and individual chunks lack context to be useful. There’s no single chunk size that works for all query types.

Re-ranking absence: Embedding similarity is a proxy for relevance, not relevance itself. The 8th-most-similar document is often more useful than the 2nd-most-similar for a given query, especially when the query is short or ambiguous.

Here’s the architecture we’re building toward:

flowchart TD
    Q[User Query] --> HQ[HyDE: Generate Hypothetical Doc]
    Q --> EQ[Embed Original Query]
    HQ --> EHQ[Embed Hypothetical Doc]
    EQ --> HS[Hybrid Search\nVector + BM25]
    EHQ --> HS
    HS --> RC[Raw Candidates\ntop-50]
    RC --> RR[Cross-Encoder Re-ranker]
    RR --> CC[Contextual Compression]
    CC --> LLM[LLM Answer Generation]
    LLM --> A[Final Answer]

Setup: Install Dependencies

All examples use LangChain, but the concepts translate directly to LlamaIndex, Haystack, or a custom stack.

pip install langchain langchain-openai langchain-community \
    rank-bm25 sentence-transformers chromadb \
    openai tiktoken python-dotenv

# config.py
import os
from dotenv import load_dotenv

load_dotenv()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
EMBEDDING_MODEL = "text-embedding-3-small"
RERANKER_MODEL = "cross-encoder/ms-marco-MiniLM-L-6-v2"
LLM_MODEL = "gpt-4o-mini"
TOP_K_RETRIEVE = 50   # retrieve more, re-rank to fewer
TOP_K_FINAL = 5       # feed this many to LLM

Hybrid Search: Combining Dense and Sparse Retrieval

Hybrid search fuses embedding-based (dense) retrieval with keyword-based (sparse) retrieval — typically BM25. Dense retrieval captures semantic meaning; sparse retrieval excels at exact term matching. Together they cover each other’s blind spots.

The fusion method that consistently outperforms simple score averaging is Reciprocal Rank Fusion (RRF). Each document receives a score of 1 / (k + rank) from each retriever, and these scores are summed. The constant k (usually 60) prevents very high-ranked documents from dominating.

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
from langchain.text_splitter import RecursiveCharacterTextSplitter

def build_hybrid_retriever(documents: list, k: int = 50):
    """
    Build a hybrid retriever combining dense (Chroma) and sparse (BM25) search
    with Reciprocal Rank Fusion.
    """
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=512,
        chunk_overlap=64,
        separators=["\n\n", "\n", ". ", " ", ""]
    )
    chunks = splitter.split_documents(documents)

    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    vectorstore = Chroma.from_documents(chunks, embeddings)
    dense_retriever = vectorstore.as_retriever(search_kwargs={"k": k})

    sparse_retriever = BM25Retriever.from_documents(chunks)
    sparse_retriever.k = k

    # EnsembleRetriever applies RRF internally
    hybrid_retriever = EnsembleRetriever(
        retrievers=[dense_retriever, sparse_retriever],
        weights=[0.6, 0.4]   # tune based on your corpus
    )
    return hybrid_retriever, chunks

The weights parameter here doesn’t apply a simple weighted average — it controls how ties are broken in RRF. For technical documentation with precise terminology, push sparse weight toward 0.5. For conversational or philosophical content, lean dense.

HyDE: Hypothetical Document Embeddings

HyDE inverts the retrieval problem. Instead of embedding the query and hoping it’s close to the relevant document embeddings, you ask the LLM to generate a hypothetical document that would answer the query — then embed that. The hypothesis lives in “answer space,” which tends to be much closer to real answer documents than a short question does.

from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import OpenAIEmbeddings

HYDE_PROMPT = ChatPromptTemplate.from_template("""
Write a concise technical paragraph (3-5 sentences) that would directly answer this question.
Write as if you are a technical documentation author. Do not hedge or say "it depends."

Question: {question}

Technical Answer Paragraph:
""")

def hyde_retriever(question: str, retriever, llm_model: str = "gpt-4o-mini"):
    """
    Generate a hypothetical document, embed it, and use it for retrieval.
    Falls back to original query if HyDE generation fails.
    """
    llm = ChatOpenAI(model=llm_model, temperature=0.0)
    hyde_chain = HYDE_PROMPT | llm | StrOutputParser()

    try:
        hypothetical_doc = hyde_chain.invoke({"question": question})
    except Exception:
        hypothetical_doc = question  # graceful fallback

    # Retrieve using hypothetical document
    docs_from_hyde = retriever.invoke(hypothetical_doc)
    # Also retrieve using original query for diversity
    docs_from_query = retriever.invoke(question)

    # Deduplicate by page_content hash
    seen = set()
    combined = []
    for doc in docs_from_hyde + docs_from_query:
        key = hash(doc.page_content)
        if key not in seen:
            seen.add(key)
            combined.append(doc)

    return combined

HyDE works best for factual, technical queries. It adds one LLM call per query, so if latency is critical, you can A/B test it per query category.

Cross-Encoder Re-Ranking

Re-ranking is the highest-ROI improvement you can make to an existing RAG pipeline. A cross-encoder takes the full (query, document) pair as input — unlike bi-encoders that embed query and document separately — allowing it to compute true relevance rather than proxy similarity.

The tradeoff: cross-encoders are too slow to scan an entire corpus (O(n) inference), so you use them as a second pass over the top-50 bi-encoder candidates.

from sentence_transformers import CrossEncoder
from langchain_core.documents import Document

class CrossEncoderReranker:
    def __init__(self, model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2"):
        self.model = CrossEncoder(model_name)

    def rerank(
        self,
        query: str,
        documents: list[Document],
        top_k: int = 5
    ) -> list[Document]:
        if not documents:
            return []

        pairs = [(query, doc.page_content) for doc in documents]
        scores = self.model.predict(pairs)

        scored_docs = sorted(
            zip(scores, documents),
            key=lambda x: x[0],
            reverse=True
        )

        # Attach score to metadata for debugging/logging
        reranked = []
        for score, doc in scored_docs[:top_k]:
            doc.metadata["rerank_score"] = float(score)
            reranked.append(doc)

        return reranked

For higher accuracy at the cost of latency, use cross-encoder/ms-marco-electra-base. For multilingual corpora, cross-encoder/mmarco-mMiniLMv2-L12-H384-v1 covers 13 languages.

Contextual Compression: Cut the Noise Before the LLM Sees It

Even after re-ranking, a retrieved chunk often contains irrelevant sentences. Contextual compression extracts only the portions of each document that are relevant to the query. This reduces LLM token consumption and improves answer quality by eliminating distracting context.

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import ChatOpenAI

def build_compression_retriever(base_retriever, llm_model: str = "gpt-4o-mini"):
    """
    Wrap a retriever with LLM-based contextual compression.
    The compressor extracts relevant spans from each retrieved document.
    """
    llm = ChatOpenAI(model=llm_model, temperature=0)
    compressor = LLMChainExtractor.from_llm(llm)

    compression_retriever = ContextualCompressionRetriever(
        base_compressor=compressor,
        base_retriever=base_retriever
    )
    return compression_retriever

Contextual compression adds LLM calls proportional to the number of retrieved documents (top_k calls per query). Use it selectively — apply it only when answer quality matters more than latency, or cache compressed results per (query_hash, doc_hash) pair.

Putting It All Together: The Production Pipeline

Here’s the complete pipeline combining hybrid search + HyDE + re-ranking + compression:

from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

ANSWER_PROMPT = ChatPromptTemplate.from_template("""
You are a precise technical assistant. Answer the question using ONLY the provided context.
If the context doesn't contain enough information, say so explicitly.

Context:
{context}

Question: {question}

Answer:
""")

def format_docs(docs: list) -> str:
    return "\n\n---\n\n".join(
        f"[Source: {doc.metadata.get('source', 'unknown')}]\n{doc.page_content}"
        for doc in docs
    )

class AdvancedRAGPipeline:
    def __init__(self, documents: list):
        self.hybrid_retriever, self.chunks = build_hybrid_retriever(
            documents, k=50
        )
        self.reranker = CrossEncoderReranker()
        self.llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

    def run(self, question: str, use_hyde: bool = True) -> dict:
        # Step 1: Retrieve with optional HyDE
        if use_hyde:
            candidates = hyde_retriever(question, self.hybrid_retriever)
        else:
            candidates = self.hybrid_retriever.invoke(question)

        # Step 2: Re-rank
        reranked = self.reranker.rerank(question, candidates, top_k=5)

        # Step 3: Generate answer
        context = format_docs(reranked)
        answer_chain = ANSWER_PROMPT | self.llm | StrOutputParser()
        answer = answer_chain.invoke({
            "context": context,
            "question": question
        })

        return {
            "answer": answer,
            "sources": [doc.metadata.get("source") for doc in reranked],
            "rerank_scores": [doc.metadata.get("rerank_score") for doc in reranked],
            "num_candidates": len(candidates)
        }

# Usage
if __name__ == "__main__":
    from langchain_community.document_loaders import DirectoryLoader

    loader = DirectoryLoader("./docs", glob="**/*.md")
    documents = loader.load()

    pipeline = AdvancedRAGPipeline(documents)
    result = pipeline.run("What are the rate limits for the OpenAI embeddings API?")

    print(result["answer"])
    print(f"\nSources: {result['sources']}")
    print(f"Re-rank scores: {result['rerank_scores']}")

For a no-code equivalent of a basic RAG pipeline, see Build a RAG Pipeline in n8n with a Vector Database — it’s useful for prototyping before committing to a Python implementation.

For structured extraction from retrieved documents (useful when your RAG answers need to populate a schema), check out LangChain Structured Output: Extract Data with Pydantic.

Production Checklist

Before shipping an advanced RAG pipeline:

Evaluate retrieval separately from generation. Track recall@k (did the right doc appear in top-k?) and MRR (mean reciprocal rank). A bad retriever cannot be saved by a good LLM.
Cache embeddings aggressively. Re-embedding the same corpus on every restart is wasteful. Use a persistent vector store (Chroma with persistence, Pinecone, Qdrant) and embed only new or changed documents.
Monitor re-rank score distribution. If your top reranked document consistently scores below 0.3, your retrieval candidates are poor — tune chunk size or adjust the hybrid weights.
Set a relevance threshold. If the highest re-rank score is below a configurable threshold (e.g., 0.1), return “I don’t have enough information” rather than hallucinating. Never force an answer from irrelevant context.
Log query–answer pairs. Human spot-checking of 20–30 random pairs per week catches systematic failures before users notice them.

Frequently Asked Questions

When should I use HyDE vs. standard query embedding?

Use HyDE when your queries are short questions (under 20 words) and your documents are longer prose. HyDE helps most when the vocabulary gap between questions and answers is large — for example, a user asks “how do I prevent rate limits?” but the documentation says “implement exponential backoff with jitter.” Avoid HyDE for queries that already contain precise technical terms (model names, API method signatures), where the original query embedding already lands near the right documents.

What chunk size should I use for advanced RAG?

There is no universal answer, but a practical heuristic: use smaller chunks (256–512 tokens) for retrieval, and larger “parent chunks” (1024–2048 tokens) for the LLM context. This is called parent-document retrieval — you embed small chunks for precision, but when a small chunk is selected, you return its parent chunk to give the LLM enough context. LangChain’s ParentDocumentRetriever implements this pattern directly.

How much does re-ranking slow down my pipeline?

The ms-marco-MiniLM-L-6-v2 cross-encoder running on CPU adds roughly 20–80ms for re-ranking 50 candidates, depending on chunk length. On a GPU (even a T4), it’s under 5ms. The latency cost is almost always worth the quality gain. If you’re latency-constrained, reduce your initial retrieval pool (top-30 instead of top-50) rather than skipping re-ranking.

Can I use advanced RAG techniques without LangChain?

Yes. The concepts are framework-agnostic. CrossEncoder from sentence-transformers works standalone. BM25 via rank-bm25 requires no orchestration framework. The main LangChain dependency in these examples is EnsembleRetriever for hybrid search — you can replace it with manual RRF logic in about 15 lines of Python.

How do I evaluate whether advanced RAG is actually better than simple RAG?

Use RAGAS (Retrieval Augmented Generation Assessment) to measure four metrics: answer faithfulness (does the answer match the context?), answer relevancy (does the answer address the question?), context precision (are retrieved docs relevant?), and context recall (were relevant docs retrieved?). Run both pipelines against a 50–100 question golden dataset and compare. Advanced RAG should improve context precision and recall significantly; answer quality gains follow from those.