Advanced Claw Code: Building a RAG Agent with a Vector Database

Q: How do I update the index when source documents change?

Use ChromaDB's upsert method with the same IDs as the original chunks. Since IDs are constructed as {path}::chunk{i}, re-running ingest.py after editing a file will overwrite only the chunks belonging to that file. For deleted files, call collection.delete(where={"source": oldpath}) before re-ingesting.

Q: Can I run the Claw binary inside a Docker container?

Yes — the Claw Code repository ships a Containerfile for exactly this use case. Build the image with docker build -f Containerfile -t claw-code ., then mount your chromastore/ and docs/ directories as volumes. Update CLAWBINARY in agent.py to the path inside the container or pass it as an environment variable.

Q: Is there a way to stream the Claw output instead of waiting for the full response?

The claw prompt subcommand does not currently expose a streaming flag in its CLI interface. For streaming output, you would need to interact with the Anthropic API directly in Python and bypass the Claw binary for the generation step — while still using Claw for other agentic workflows (tool use, file editing, etc.). Watch the ultraworkers/claw-code repository for streaming support in future releases.

If you’ve been searching for a practical guide on Advanced Claw Code: Building a RAG Agent with a Vector Database, you’ve landed in the right place. This tutorial goes beyond basic prompt-response patterns to show you how to wire a Retrieval-Augmented Generation (RAG) pipeline into Claw Code — giving your agent long-term, searchable knowledge from your own documents. By the end, you’ll have a fully working system that indexes a document corpus into ChromaDB, retrieves semantically relevant chunks at query time, and feeds that context into Claw Code’s underlying Claude model.

Architecture Overview

Before writing a single line of code, it helps to see how all the pieces connect. A RAG agent has three distinct phases: ingestion (chunking and embedding documents), retrieval (finding relevant chunks for a given query), and generation (passing retrieved context to the LLM).

flowchart TD
    A[Raw Documents] --> B[Chunker]
    B --> C[Embedding Model]
    C --> D[(ChromaDB Vector Store)]
    E[User Query] --> F[Query Embedder]
    F --> G[Similarity Search]
    D --> G
    G --> H[Top-K Chunks]
    H --> I[Prompt Builder]
    E --> I
    I --> J[claw CLI]
    J --> K[Claude API]
    K --> L[Final Answer]

Claw Code acts as the generation layer — it receives a fully constructed prompt containing the retrieved context and returns a grounded answer. The rest of the pipeline is Python orchestration.

Prerequisites and Environment Setup

You’ll need the following in place before starting:

Rust toolchain installed via rustup
Claw Code binary built from source
Python 3.10+ for the orchestration scripts
An Anthropic API key

Build the Claw Code binary first:

git clone https://github.com/ultraworkers/claw-code
cd claw-code/rust
cargo build --workspace

Verify the binary is healthy:

./target/debug/claw doctor

Next, set up the Python environment:

python -m venv .venv
source .venv/bin/activate          # Windows: .venv\Scripts\activate
pip install chromadb sentence-transformers pypdf tiktoken

Export your API key:

export ANTHROPIC_API_KEY="sk-ant-..."

Finally, create the project layout:

rag-agent/
├── claw-code/               ← cloned repo, binary at rust/target/debug/claw
├── ingest.py                ← document indexing script
├── agent.py                 ← RAG query + claw CLI integration
├── docs/                    ← your source documents (.txt, .pdf)
└── chroma_store/            ← persisted vector DB (auto-created)

Ingesting Documents into ChromaDB

Document ingestion converts raw text into vector embeddings stored in a persistent collection. We’ll use sentence-transformers for local, zero-cost embeddings — no external API call required at index time.

# ingest.py
import os
import glob
import chromadb
from chromadb.utils import embedding_functions

DOCS_DIR = "./docs"
CHROMA_PATH = "./chroma_store"
COLLECTION_NAME = "rag_corpus"
CHUNK_SIZE = 500        # characters
CHUNK_OVERLAP = 50


def chunk_text(text: str, size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP) -> list[str]:
    """Split text into overlapping chunks."""
    chunks = []
    start = 0
    while start < len(text):
        end = start + size
        chunks.append(text[start:end])
        start += size - overlap
    return chunks


def load_documents(directory: str) -> list[dict]:
    """Load .txt files from a directory."""
    docs = []
    for path in glob.glob(os.path.join(directory, "**/*.txt"), recursive=True):
        with open(path, "r", encoding="utf-8") as f:
            content = f.read()
        docs.append({"path": path, "content": content})
    return docs


def ingest():
    client = chromadb.PersistentClient(path=CHROMA_PATH)

    # Local embedding model — runs entirely on CPU
    ef = embedding_functions.SentenceTransformerEmbeddingFunction(
        model_name="all-MiniLM-L6-v2"
    )

    collection = client.get_or_create_collection(
        name=COLLECTION_NAME,
        embedding_function=ef,
        metadata={"hnsw:space": "cosine"},
    )

    docs = load_documents(DOCS_DIR)
    if not docs:
        print(f"No .txt files found in {DOCS_DIR}")
        return

    all_chunks, all_ids, all_meta = [], [], []
    for doc in docs:
        chunks = chunk_text(doc["content"])
        for i, chunk in enumerate(chunks):
            all_chunks.append(chunk)
            all_ids.append(f"{doc['path']}::chunk{i}")
            all_meta.append({"source": doc["path"], "chunk_index": i})

    collection.upsert(
        documents=all_chunks,
        ids=all_ids,
        metadatas=all_meta,
    )
    print(f"Indexed {len(all_chunks)} chunks from {len(docs)} document(s).")


if __name__ == "__main__":
    ingest()

Place some .txt files in ./docs/ and run:

python ingest.py

The cosine similarity metric (hnsw:space: cosine) is recommended for sentence embeddings — it normalizes for document length, giving more consistent retrieval quality.

Building the RAG Query and Claw Integration

The retrieval layer queries ChromaDB, assembles a system prompt containing the top-K chunks as grounding context, then shells out to the claw binary via subprocess. This keeps all the agentic reasoning inside Claw Code while your Python layer handles knowledge retrieval.

# agent.py
import subprocess
import sys
import chromadb
from chromadb.utils import embedding_functions

CHROMA_PATH = "./chroma_store"
COLLECTION_NAME = "rag_corpus"
CLAW_BINARY = "./claw-code/rust/target/debug/claw"
TOP_K = 5
MAX_CONTEXT_CHARS = 4000


def retrieve(query: str, top_k: int = TOP_K) -> list[str]:
    """Return the top-K most relevant document chunks for a query."""
    client = chromadb.PersistentClient(path=CHROMA_PATH)
    ef = embedding_functions.SentenceTransformerEmbeddingFunction(
        model_name="all-MiniLM-L6-v2"
    )
    collection = client.get_collection(name=COLLECTION_NAME, embedding_function=ef)
    results = collection.query(
        query_texts=[query],
        n_results=top_k,
        include=["documents", "metadatas", "distances"],
    )
    return results["documents"][0]  # list of chunk strings


def build_rag_prompt(query: str, chunks: list[str]) -> str:
    """Construct a grounded prompt from retrieved context."""
    context_block = "\n\n---\n\n".join(chunks)
    # Truncate to avoid exceeding context window
    if len(context_block) > MAX_CONTEXT_CHARS:
        context_block = context_block[:MAX_CONTEXT_CHARS] + "\n...[truncated]"

    return (
        "You are a precise technical assistant. "
        "Answer ONLY using the provided context. "
        "If the answer is not in the context, say so clearly.\n\n"
        f"## Retrieved Context\n\n{context_block}\n\n"
        f"## Question\n\n{query}"
    )


def ask_claw(prompt: str) -> str:
    """Send the prompt to the claw CLI and return its output."""
    result = subprocess.run(
        [CLAW_BINARY, "prompt", prompt],
        capture_output=True,
        text=True,
        timeout=120,
    )
    if result.returncode != 0:
        raise RuntimeError(f"claw exited with code {result.returncode}\n{result.stderr}")
    return result.stdout.strip()


def rag_query(query: str) -> str:
    chunks = retrieve(query)
    prompt = build_rag_prompt(query, chunks)
    return ask_claw(prompt)


if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python agent.py \"your question here\"")
        sys.exit(1)
    answer = rag_query(" ".join(sys.argv[1:]))
    print(answer)

Run a query:

python agent.py "What are the main configuration options for the pipeline?"

The subprocess.run call passes the assembled RAG prompt directly to the claw prompt subcommand. No custom plugin or API wrapper is needed — the Claw binary handles all communication with Claude.

Production Patterns

A working prototype is one thing; a production RAG agent needs to handle edge cases, latency, and observability. Apply these patterns before deploying.

Re-ranking Retrieved Chunks

Re-ranking scores retrieved chunks a second time using a cross-encoder model, which is slower but more accurate than bi-encoder similarity alone. Add this step between retrieval and prompt construction:

from sentence_transformers import CrossEncoder

_reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, chunks: list[str], top_n: int = 3) -> list[str]:
    pairs = [(query, chunk) for chunk in chunks]
    scores = _reranker.predict(pairs)
    ranked = sorted(zip(scores, chunks), reverse=True)
    return [chunk for _, chunk in ranked[:top_n]]

Update rag_query to call rerank(query, chunks) before passing to build_rag_prompt.

Structured Metadata Filtering

ChromaDB supports where-clause filtering so you can scope retrieval to a specific document subset:

results = collection.query(
    query_texts=[query],
    n_results=top_k,
    where={"source": {"$contains": "architecture"}},  # only docs from architecture/ folder
    include=["documents"],
)

This is essential when your corpus contains documents from multiple domains and you want to route queries to the right subset. For similar patterns in tool-using agents, see Advanced AutoGen: Empowering Agents with Custom Tools and Functions.

Logging and Tracing

Add a thin logging wrapper around ask_claw to capture inputs, outputs, latency, and chunk sources for debugging:

import json
import time
from pathlib import Path

LOG_FILE = Path("./rag_trace.jsonl")

def logged_rag_query(query: str) -> str:
    t0 = time.monotonic()
    chunks = retrieve(query)
    reranked = rerank(query, chunks)
    prompt = build_rag_prompt(query, reranked)
    answer = ask_claw(prompt)
    elapsed = round(time.monotonic() - t0, 3)

    LOG_FILE.open("a").write(json.dumps({
        "query": query,
        "chunks_retrieved": len(chunks),
        "chunks_after_rerank": len(reranked),
        "answer_preview": answer[:200],
        "latency_s": elapsed,
    }) + "\n")

    return answer

For data-aware agent frameworks that take a similar approach, LlamaIndex Agents: Build Tool-Using Agents Over Your Data is a strong reference for comparison.

Handling the 120-Second Claw Timeout

For large corpora that produce long prompts, the subprocess.run timeout of 120 seconds may need tuning. Set it explicitly and catch subprocess.TimeoutExpired:

try:
    result = subprocess.run(
        [CLAW_BINARY, "prompt", prompt],
        capture_output=True,
        text=True,
        timeout=300,  # 5 minutes for complex synthesis tasks
    )
except subprocess.TimeoutExpired:
    return "[ERROR] claw timed out — try a shorter or more specific query."

Frequently Asked Questions

Why use ChromaDB instead of Pinecone or Weaviate?

ChromaDB runs entirely locally without any cloud account or API key, which makes it ideal for development and small-to-medium corpora (under ~1 million documents). For production deployments serving high query volumes or requiring multi-tenant isolation, a managed vector service like Pinecone offers better scalability. The code in this tutorial can be swapped to Pinecone by replacing the chromadb.PersistentClient calls with the Pinecone Python client — the retrieval interface is nearly identical.

How do I update the index when source documents change?

Use ChromaDB’s upsert method with the same IDs as the original chunks. Since IDs are constructed as {path}::chunk{i}, re-running ingest.py after editing a file will overwrite only the chunks belonging to that file. For deleted files, call collection.delete(where={"source": old_path}) before re-ingesting.

Can I run the Claw binary inside a Docker container?

Yes — the Claw Code repository ships a Containerfile for exactly this use case. Build the image with docker build -f Containerfile -t claw-code ., then mount your chroma_store/ and docs/ directories as volumes. Update CLAW_BINARY in agent.py to the path inside the container or pass it as an environment variable.

What chunk size should I use?

500–800 characters (roughly 100–150 tokens) works well for most technical documentation. Smaller chunks improve retrieval precision but may lose necessary surrounding context. Larger chunks preserve context but dilute the embedding signal, hurting similarity scores. If you’re indexing code files, chunk by function boundary rather than by character count — split on def or class delimiters for better semantic granularity.

Is there a way to stream the Claw output instead of waiting for the full response?

The claw prompt subcommand does not currently expose a streaming flag in its CLI interface. For streaming output, you would need to interact with the Anthropic API directly in Python and bypass the Claw binary for the generation step — while still using Claw for other agentic workflows (tool use, file editing, etc.). Watch the ultraworkers/claw-code repository for streaming support in future releases.