The two modules that separate OpenJarvis from a basic LLM wrapper are the Engine and Learning layers. Most tutorials stop at “run Ollama, send a prompt, print the result.” This article does not. Here you will configure production-grade inference backends, understand the memory architecture that makes agents genuinely stateful, build a RAG knowledge base from real documents, and benchmark every supported engine against each other so you can make a data-driven choice for your use case.
Before proceeding, make sure you have completed the installation walkthrough in How to Install OpenJarvis. This article assumes a working config.toml and a running inference backend. If you are unfamiliar with the five-module architecture, revisit What Is OpenJarvis? first — understanding how Engine and Learning interact with the Intelligence and Agent modules is essential context for everything below.
The Engine Module
The Engine module is OpenJarvis’s hardware abstraction layer. Its job is deceptively simple: accept a prompt from the Intelligence module, route it to the right inference backend, and return a completion. In practice, this abstraction is what makes OpenJarvis genuinely portable — you can migrate from a laptop running Ollama to a multi-GPU inference cluster running vLLM by changing two lines in a TOML file, with zero changes to agent logic, memory configuration, or tool definitions.
At the implementation level, every backend is registered as an EngineAdapter conforming to a common protocol:
class EngineAdapter(Protocol):
def complete(self, prompt: str, params: InferenceParams) -> Completion: ...
def stream(self, prompt: str, params: InferenceParams) -> Iterator[str]: ...
def health(self) -> HealthStatus: ...
def model_info(self) -> ModelInfo: ...
The four-method contract is intentionally minimal. OpenJarvis does not require backends to support batching, structured generation, or multi-modal inputs — those are opt-in capabilities that specific adapters advertise through ModelInfo.capabilities. The Intelligence module queries these capabilities at initialization time and enables features accordingly. If your backend cannot stream, Intelligence falls back to blocking completions. If it cannot produce logprobs, token budget estimation uses heuristics instead.
The five supported backends map onto different points on the latency-throughput-complexity curve:
| Backend | Type | Primary Strength | Typical Deployment |
|---|---|---|---|
| Ollama | Local REST server | Zero-config developer experience | Laptops, dev machines |
| vLLM | High-throughput local server | Continuous batching, PagedAttention | Multi-GPU servers, team deployments |
| SGLang | Structured generation runtime | Grammar-constrained outputs, radix cache | Tool-calling agents, JSON-heavy workflows |
| llama.cpp | Embedded inference library | Minimal dependencies, CPU/edge capable | Air-gapped systems, edge hardware |
| Cloud API | OpenAI-compatible remote API | Frontier model access, no hardware needed | Fallback, frontier model tasks |
Engine selection has a direct and measurable impact on every latency-sensitive interaction. The sections below cover each backend’s configuration in the depth required to optimize it for real workloads.
Configuring Ollama
Ollama remains the recommended starting point because it handles the hardest parts of local model serving automatically: model weight downloading, quantization selection, GPU memory management, and a clean HTTP API. For a single-developer workflow, it is rarely the bottleneck.
Basic Configuration
The minimal working Ollama configuration:
[engine]
default = "ollama"
[engine.ollama]
host = "http://localhost:11434"
timeout_seconds = 120
keep_alive = "5m" # how long Ollama keeps the model loaded after last request
The keep_alive parameter is easy to overlook but has a significant impact on perceived latency. When set to "5m", Ollama keeps model weights in VRAM for five minutes after the last request, so follow-up queries do not pay the model-loading penalty (typically 3–8 seconds for a 7B model). For interactive chat workloads, set this to "30m" or "1h". For batch processing where you want VRAM freed between jobs, set it to "0".
GPU Allocation and Model Selection
Ollama automatically detects and uses available GPUs, but you can control VRAM allocation explicitly. First, identify your hardware:
ollama ps # show loaded models and their VRAM consumption
nvidia-smi # NVIDIA GPU memory overview (if applicable)
The relationship between model size, quantization level, and required VRAM follows a predictable formula. For a transformer model with P billion parameters at quantization level Q bits per weight:
VRAM (GB) ≈ (P × Q) / 8 + overhead (1–2 GB)
Practical targets for common hardware:
# RTX 3060 (12 GB VRAM) — good for 7B at Q8 or 13B at Q4
[intelligence]
default_model = "mistral:7b-instruct-q8_0"
# RTX 4090 (24 GB VRAM) — can run 13B at Q8 or 34B at Q4
[intelligence]
default_model = "codellama:34b-instruct-q4_K_M"
# Apple M2 Pro (16 GB unified memory) — 13B at Q4 is comfortable
[intelligence]
default_model = "llama3:13b-instruct-q4_K_M"
# CPU only (32 GB RAM) — stay at 7B Q4 for reasonable speed
[intelligence]
default_model = "qwen3:8b-q4_K_M"
Performance Tuning
For maximum Ollama throughput on a single-user workload, the following configuration options in config.toml are worth tuning:
[engine]
default = "ollama"
[engine.ollama]
host = "http://localhost:11434"
timeout_seconds = 180
keep_alive = "30m"
num_ctx = 8192 # context window size; higher uses more VRAM
num_predict = 512 # max tokens to generate per request
num_thread = 8 # CPU threads for prompt preprocessing
temperature = 0.7
[engine.ollama.gpu]
layers = -1 # -1 = offload all layers to GPU (recommended)
f16_kv = true # use float16 for key/value cache (saves VRAM)
The num_ctx setting deserves special attention. It controls how many tokens the model can consider in its context window at once. For conversational use, 4096 is sufficient. For document analysis or long code files, push it to 8192 or 16384 — but note that VRAM consumption scales linearly with context length for most architectures. A 7B model at num_ctx = 16384 uses roughly double the KV-cache VRAM of the same model at 8192.
Configuring vLLM and SGLang
When your workload outgrows Ollama — typically when you need concurrent users, high-volume batch processing, or maximum tokens-per-second throughput — vLLM and SGLang are the right tools. Both require more setup than Ollama but deliver meaningfully better performance at scale.
vLLM
vLLM implements PagedAttention, a memory management technique that treats the KV cache like virtual memory pages rather than a fixed contiguous buffer. The result is significantly higher throughput for batched requests and better VRAM utilization under concurrent load.
Install vLLM in the same virtual environment as OpenJarvis:
pip install vllm
Start the vLLM server (this is the equivalent of ollama serve):
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-7B-Instruct-v0.2 \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--tensor-parallel-size 1 # set to 2+ for multi-GPU
The --gpu-memory-utilization 0.90 flag tells vLLM to use 90% of available VRAM for the KV cache pool. The remaining 10% is headroom for activations and other runtime overhead. On a tight 12 GB card, you may need to drop this to 0.85 to avoid OOM errors.
Configure OpenJarvis to use the vLLM server:
[engine]
default = "vllm"
[engine.vllm]
host = "http://localhost:8000"
timeout_seconds = 300
model = "mistralai/Mistral-7B-Instruct-v0.2"
max_tokens = 1024
temperature = 0.7
[engine.vllm.sampling]
top_p = 0.95
repetition_penalty = 1.1
For multi-GPU deployments, update the --tensor-parallel-size flag to match the number of GPUs and do not change the OpenJarvis config — the parallelism is handled transparently by the vLLM server:
# Two A100 80GB GPUs — run a 70B model comfortably
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-70b-instruct \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.92 \
--max-model-len 16384
SGLang
SGLang’s core innovation is its radix attention cache: it reuses KV cache computations across requests that share a common prefix (such as a long system prompt repeated across all agent calls). For OpenJarvis workloads where the system prompt, tool definitions, and injected memory context are shared across many queries, this cache reuse can reduce first-token latency by 40–60% compared to Ollama or standard vLLM.
Install SGLang:
pip install "sglang[all]"
Start the SGLang runtime:
python -m sglang.launch_server \
--model-path mistralai/Mistral-7B-Instruct-v0.2 \
--host 0.0.0.0 \
--port 30000 \
--mem-fraction-static 0.88 \
--max-prefill-tokens 16384
Configure OpenJarvis for SGLang:
[engine]
default = "sglang"
[engine.sglang]
host = "http://localhost:30000"
timeout_seconds = 240
model = "mistralai/Mistral-7B-Instruct-v0.2"
max_tokens = 1024
temperature = 0.7
[engine.sglang.options]
enable_torch_compile = true # JIT compilation for ~15% throughput gain
chunked_prefill = true # better memory efficiency for long prompts
SGLang is particularly valuable when the Learning module injects large memory contexts into every prompt. A 2,000-token memory context prepended to every agent call is exactly the kind of shared prefix that SGLang’s radix cache handles well. In practice, after the first few requests warm the cache, subsequent queries with the same memory prefix respond 30–60% faster than they would under Ollama.
Using llama.cpp Directly
llama.cpp occupies a unique position in the OpenJarvis engine lineup: it is the only option that requires neither a running server process nor a GPU. It operates as a Python-called library — inference happens in-process via C++ bindings, with GGUF model files loaded directly from disk.
This makes llama.cpp the engine of choice for:
- Air-gapped deployments with no network access to a local inference server
- Raspberry Pi, Jetson Nano, or similar embedded boards with limited RAM and no NVIDIA GPU
- Docker containers where you want a single process with no sidecar services
- Developer environments where you want instant startup without pre-running a server
Installation and Model Preparation
pip install llama-cpp-python
For GPU acceleration via CUDA (significantly faster than pure CPU):
CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
Download a GGUF-format model file. The recommended source is Hugging Face, specifically the bartowski or TheBloke namespaces which maintain extensive quantized GGUF collections:
# Using huggingface-cli (pip install huggingface-hub)
huggingface-cli download \
bartowski/Mistral-7B-Instruct-v0.2-GGUF \
Mistral-7B-Instruct-v0.2-Q4_K_M.gguf \
--local-dir ~/.openjarvis/models/
Configuration
[engine]
default = "llamacpp"
[engine.llamacpp]
model_path = "~/.openjarvis/models/Mistral-7B-Instruct-v0.2-Q4_K_M.gguf"
n_ctx = 4096 # context window in tokens
n_threads = 6 # CPU threads for inference
n_gpu_layers = 0 # set to -1 to offload all layers to GPU
temperature = 0.7
max_tokens = 512
[engine.llamacpp.performance]
use_mlock = true # lock model weights in RAM (prevents swapping)
use_mmap = true # memory-map model file (faster cold start)
numa = false # enable only on multi-socket NUMA systems
The n_gpu_layers parameter is the key knob for hybrid CPU/GPU operation. Setting it to -1 offloads all transformer layers to the GPU, maximizing speed on CUDA hardware. Setting it to a positive integer (e.g., 20) offloads that many layers and runs the rest on CPU — useful when your model is too large to fit entirely in VRAM but you still want partial GPU acceleration.
On CPU-only hardware, the primary levers for performance are n_threads (set to your physical core count, not logical/hyperthreaded count) and quantization level (Q4_K_M strikes the best balance of quality and speed for most use cases).
The Learning Module
The Learning module is what makes OpenJarvis agents genuinely useful over time rather than stateless question-answering machines. It implements persistent memory through a combination of vector similarity search and structured storage, giving agents the ability to remember facts, recall past interactions, and build up domain knowledge incrementally.
Memory Architecture Overview
OpenJarvis Learning uses a three-tier memory architecture:
┌─────────────────────────────────────────────────────────────────┐
│ Working Memory │
│ In-process token buffer — current conversation context only │
│ Lives in RAM, lost on process exit │
└──────────────────────────────┬──────────────────────────────────┘
│ flush on session end
▼
┌─────────────────────────────────────────────────────────────────┐
│ Episodic Memory │
│ Recent sessions, interaction summaries, user preferences │
│ Stored as embeddings in SQLite with vector extension │
│ Retained for configurable window (default: 90 days) │
└──────────────────────────────┬──────────────────────────────────┘
│ distillation (nightly by default)
▼
┌─────────────────────────────────────────────────────────────────┐
│ Semantic Memory │
│ Ingested documents, distilled facts, long-term knowledge │
│ Stored as embeddings in SQLite or external vector DB │
│ Retained indefinitely (manual pruning required) │
└─────────────────────────────────────────────────────────────────┘
At query time, the Intelligence module retrieves from both Episodic and Semantic memory. It runs two similarity searches in parallel: one against the episodic store (looking for relevant past interactions) and one against the semantic store (looking for relevant ingested knowledge). The top-k results from each are merged, ranked by a weighted combination of semantic similarity and recency, and injected into the prompt as a memory context block.
Storage Configuration
The default SQLite backend works without any additional installation and is suitable for single-user workloads with up to a few hundred thousand stored memory chunks:
[storage]
backend = "sqlite"
path = "~/.openjarvis/memory.db"
embedding_dim = 768 # must match your embedding model's output dimension
top_k = 8 # number of memories to retrieve per query
similarity = "cosine" # cosine | dot_product | euclidean
[storage.episodic]
retention_days = 90 # older episodes are compressed and distilled
max_entries = 10000 # cap before oldest entries are evicted
[storage.semantic]
max_entries = 100000 # larger; semantic knowledge grows slowly
For teams or larger knowledge bases, OpenJarvis supports external vector databases. Switch to Pinecone, Qdrant, or Chroma by changing the backend field:
[storage]
backend = "qdrant"
[storage.qdrant]
host = "http://localhost:6333"
collection = "openjarvis_memory"
embedding_dim = 768
api_key = "${QDRANT_API_KEY}" # optional; required for Qdrant Cloud
Disclosure: This article contains affiliate links. We may earn a commission at no extra cost to you.
If you are choosing between vector database providers, Pinecone offers a managed cloud option with a generous free tier that is particularly convenient if you want to host your OpenJarvis knowledge base without managing infrastructure. Qdrant is the recommended self-hosted option.
The Embedding Model
Learning quality depends as much on the embedding model as on the underlying LLM. OpenJarvis uses a separate, lightweight embedding model to convert text into vectors — this is distinct from the model used for generation. The default is all-MiniLM-L6-v2 (384 dimensions, fast), but for better retrieval quality, switch to a larger model:
[learning]
embedding_model = "BAAI/bge-base-en-v1.5" # 768-dim, strong English retrieval
embedding_device = "cpu" # cpu | cuda | mps
embedding_batch_size = 32 # how many chunks to embed in parallel
[learning.chunking]
strategy = "semantic" # semantic | fixed | sentence
chunk_size = 512 # tokens per chunk (semantic mode uses this as target)
chunk_overlap = 64 # overlap between adjacent chunks
The embedding_dim in [storage] must match the output dimension of embedding_model. If you change the embedding model on an existing database, you must re-embed all stored content — OpenJarvis provides a CLI command for this:
jarvis memory reindex --embedding-model BAAI/bge-base-en-v1.5
This process is slow (expect minutes to hours depending on database size) but is only required once when changing embedding models.
Building a RAG Knowledge Base
The Learning module’s document ingestion pipeline turns your own files — PDFs, Markdown, source code, web pages — into searchable knowledge that the agent draws on automatically. This is the OpenJarvis equivalent of the RAG (Retrieval-Augmented Generation) patterns covered in depth in Getting Started with LlamaIndex, but implemented as a persistent, session-spanning memory store rather than a per-query retrieval pipeline.
Ingesting Documents via CLI
Single file ingestion:
jarvis memory ingest --file /path/to/research-paper.pdf --tag arxiv --tag llm-research
Bulk ingestion from a directory:
jarvis memory ingest \
--dir /path/to/documents/ \
--recursive \
--glob "*.{md,pdf,txt,py}" \
--tag project-docs \
--chunk-size 512 \
--chunk-overlap 64
Check ingestion progress and current database state:
jarvis memory stats
# Episodic entries: 1,247
# Semantic entries: 8,934
# Total chunks: 10,181
# Database size: 48.3 MB
# Embedding model: BAAI/bge-base-en-v1.5 (768-dim)
Ingesting Documents via Python SDK
For programmatic ingestion — useful in automated pipelines that monitor a directory and ingest new files as they arrive:
from openjarvis import Jarvis
from openjarvis.core.config import JarvisConfig
from openjarvis.learning import DocumentIngestor, IngestConfig
from pathlib import Path
config = JarvisConfig.from_toml("~/.openjarvis/config.toml")
ingest_cfg = IngestConfig(
chunk_size=512,
chunk_overlap=64,
chunking_strategy="semantic",
embedding_model="BAAI/bge-base-en-v1.5",
tags=["project-docs", "v2-release"],
)
with Jarvis(config=config) as j:
ingestor = DocumentIngestor(jarvis=j, config=ingest_cfg)
# Ingest a single PDF
result = ingestor.ingest_file(Path("architecture-overview.pdf"))
print(f"Ingested {result.chunks_created} chunks from {result.source}")
# Ingest an entire directory
results = ingestor.ingest_directory(
Path("/home/user/project-notes/"),
glob_pattern="**/*.md",
recursive=True,
)
total_chunks = sum(r.chunks_created for r in results)
print(f"Ingested {len(results)} files, {total_chunks} total chunks")
Querying the Knowledge Base Directly
Beyond automatic memory injection, you can query the RAG store directly — useful for building search tools, validation scripts, or debugging retrieval quality:
from openjarvis import Jarvis
from openjarvis.core.config import JarvisConfig
from openjarvis.learning import MemoryRetriever, RetrievalQuery
config = JarvisConfig.from_toml("~/.openjarvis/config.toml")
with Jarvis(config=config) as j:
retriever = MemoryRetriever(jarvis=j)
# Retrieve top-5 chunks most similar to the query
query = RetrievalQuery(
text="How does the system handle authentication?",
top_k=5,
stores=["semantic"], # semantic | episodic | both
min_score=0.65, # filter out low-confidence matches
tags=["project-docs"], # optional tag filter
)
results = retriever.retrieve(query)
for hit in results:
print(f"Score: {hit.score:.3f} | Source: {hit.metadata['source']}")
print(f"Content: {hit.content[:300]}\n")
Configuring Auto-Injection Behavior
Control how retrieved memories are injected into agent prompts:
[learning]
auto_inject = true # automatically prepend memories to every prompt
inject_episodic = true # include recent conversation summaries
inject_semantic = true # include ingested document knowledge
max_inject_tokens = 1500 # cap on memory context to avoid crowding the prompt
inject_top_k = 6 # max memories to inject
inject_min_score = 0.60 # minimum similarity threshold for injection
memory_format = "summarized" # summarized | verbatim | structured
[learning.distillation]
enabled = true # nightly compression of old episodic memories
schedule = "02:00" # run at 2 AM local time
target_compression = 0.3 # compress to 30% of original size
The max_inject_tokens cap is critical. If the Learning module injects too many memories, the combined system prompt + memories + current query may exceed the model’s context window, causing truncation or errors. A safe rule of thumb: keep injected memory under 25% of your configured num_ctx.
Engine Benchmarking
The right engine depends on your hardware, latency requirements, and workload characteristics. The benchmarks below were produced by running 100 identical requests through each engine configuration using a standardized 7B model (Mistral-7B-Instruct-v0.2 in Q4_K_M quantization for local backends), measuring first-token latency and sustained throughput, on a machine with an NVIDIA RTX 4080 (16 GB VRAM), AMD Ryzen 9 7950X (16 cores), and 64 GB DDR5 RAM.
| Engine | Type | First Token Latency (p50) | First Token Latency (p99) | Throughput (tok/s) | Best For |
|---|---|---|---|---|---|
| Ollama | Local server | 280 ms | 620 ms | 58 tok/s | Developer workflows, interactive chat |
| vLLM | Local server | 190 ms | 380 ms | 127 tok/s | High-volume batch, multi-user concurrent |
| SGLang | Local server | 110 ms* | 240 ms* | 104 tok/s | Repeated prefix workloads, structured output |
| llama.cpp (GPU) | In-process | 320 ms | 710 ms | 44 tok/s | Single-process deployments, no server overhead |
| llama.cpp (CPU) | In-process | 2,800 ms | 5,200 ms | 9 tok/s | Air-gapped edge hardware, CPU-only environments |
| Cloud API (GPT-4o-mini) | Remote API | 850 ms | 2,100 ms | ~60 tok/s** | Frontier model quality, no local hardware |
* SGLang’s latency numbers reflect cache-warm requests. First-request latency (cold cache) is comparable to Ollama at ~280 ms. After 5–10 requests with the same system prompt prefix, the radix cache is populated and latency drops to the figures shown.
** Cloud API throughput is network-limited and subject to rate throttling. Actual throughput varies significantly based on provider load and tier limits.
Key takeaways from this benchmark data:
- vLLM doubles Ollama’s throughput for sustained workloads. If you process more than a few hundred agent calls per hour, the setup complexity of vLLM pays off quickly.
- SGLang’s advantage is workload-specific. For OpenJarvis’s memory-injection pattern (same 1,500-token memory prefix on every call), SGLang’s p50 latency outperforms every other local option after cache warm-up.
- llama.cpp GPU mode is close to Ollama in latency but lacks Ollama’s convenient model management and server-mode features. Its real advantage is the single-process deployment model, not raw performance.
- Cloud API latency is dominated by network round-trips. On the same machine with a 10ms local network to the vLLM server, local inference is 4–8x faster on first-token latency than GPT-4o-mini via the OpenAI API.
For the Learning module specifically, the combination of SGLang (for shared-prefix cache reuse) with a 768-dimension embedding model (for retrieval quality) delivers the best overall agent experience on hardware with 16+ GB VRAM. For CPU-only or low-VRAM environments, Ollama with a Q4_K_M quantized model and a lightweight embedding model (384-dim MiniLM) is the practical choice.
Frequently Asked Questions
Can I switch engines mid-session without restarting?
Yes, but with caveats. OpenJarvis supports runtime engine switching through both the CLI and Python SDK:
jarvis ask --engine vllm "Generate a detailed analysis of this code."
with Jarvis(config=config) as j:
# Switch engine for a specific call
response = j.ask(
"Summarize this 50-page PDF in detail.",
engine_override="vllm", # uses vLLM for this call only
)
The switch takes effect immediately for the next request. Working Memory (the current conversation buffer) is preserved across the switch because it lives in the Intelligence module, not the Engine. However, if the two engines are running different model sizes (e.g., Ollama with a 7B model and vLLM with a 34B model), the resulting response style may change noticeably. The Learning module’s episodic and semantic stores are completely unaffected — they are storage-layer components with no dependency on the active engine.
One important constraint: if you switch from a model with a 4096-token context window to one with 8192 tokens, OpenJarvis does not automatically re-inject additional memory context for the current session. You need to start a new session (or call j.reset_context()) to get the full benefit of the larger context window.
How do I use multiple GPUs with OpenJarvis?
Multiple GPU support is handled at the inference backend level, not within OpenJarvis itself. The recommended approach is:
For vLLM: Start the server with --tensor-parallel-size N where N is the number of GPUs. vLLM handles weight sharding across GPUs using tensor parallelism automatically:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-70b-instruct \
--tensor-parallel-size 4 \
--pipeline-parallel-size 1 \
--gpu-memory-utilization 0.92
For Ollama: Ollama 0.3+ supports multi-GPU automatically on NVIDIA hardware. Set the CUDA_VISIBLE_DEVICES environment variable to control which GPUs Ollama uses:
CUDA_VISIBLE_DEVICES=0,1 ollama serve
For llama.cpp: The library supports layer distribution across multiple GPUs using the --n-gpu-layers flag per device. This requires direct use of the llama-cpp-python API rather than the config.toml interface.
From OpenJarvis’s perspective, all of this is transparent — you configure the host URL and the framework routes requests to whatever the server exposes. OpenJarvis has no awareness of how many physical GPUs sit behind that endpoint.
What happens when the Learning module hits storage limits?
OpenJarvis handles storage pressure through a combination of automatic eviction and distillation, controlled by the [storage.episodic] and [storage.semantic] configuration blocks.
When episodic storage exceeds max_entries:
- The distillation process runs immediately (rather than waiting for its next scheduled run)
- The oldest
20%of episodic entries are compressed into summary entries (short descriptions of what happened in those sessions) - The original entries are deleted, freeing space
- Summaries are retained in the semantic store as long-term episodic knowledge
When semantic storage hits max_entries:
- OpenJarvis logs a warning and rejects new ingestion attempts
- The agent continues working with existing knowledge — no data is lost or evicted automatically
- You must manually prune entries:
jarvis memory prune --older-than 180d --store semanticor re-run ingestion with the--replaceflag to overwrite existing chunks from the same source
The distinction is intentional: episodic memory (interaction history) is naturally evictable and compressible. Semantic memory (deliberately ingested knowledge) is treated as precious and never auto-deleted. If you are approaching semantic storage limits, the right response is to upgrade to an external vector database (Qdrant, Pinecone) with no practical size cap, rather than letting OpenJarvis auto-prune knowledge you intentionally added.
Monitor storage health proactively:
jarvis memory stats --verbose
# Will show per-store sizes, oldest/newest entry timestamps,
# estimated time until limits are reached at current ingestion rate,
# and compression ratio of distilled entries
Can I fine-tune models and use them with OpenJarvis?
Yes — OpenJarvis has no restrictions on model origin, as long as the fine-tuned model is available through one of the supported backends. The workflow looks like this:
- Fine-tune your model using a framework like Axolotl, LLaMA-Factory, or Hugging Face TRL
- Export to GGUF (for llama.cpp) or keep as a Hugging Face-format checkpoint (for vLLM/SGLang)
- Register with your backend:
# For Ollama — create a Modelfile pointing to your fine-tuned GGUF
cat > Modelfile <<'EOF'
FROM /path/to/your-finetuned-model-Q4_K_M.gguf
SYSTEM "You are a specialized coding assistant trained on internal APIs."
PARAMETER temperature 0.5
PARAMETER num_ctx 8192
EOF
ollama create mycompany-coder -f Modelfile
ollama run mycompany-coder "test"
# Reference the custom model in config.toml
[intelligence]
default_model = "mycompany-coder"
For vLLM with a Hugging Face checkpoint:
python -m vllm.entrypoints.openai.api_server \
--model /path/to/your-finetuned-checkpoint \
--tokenizer /path/to/your-finetuned-checkpoint \
--served-model-name mycompany-coder
The Learning module’s RAG pipeline works identically with fine-tuned models — embeddings for memory retrieval are generated by the separate embedding model, not the generative model, so fine-tuning the generative model has no impact on retrieval quality.
Next Steps
With Engine and Learning configured, you have access to the full depth of what OpenJarvis can do. The natural path forward from here depends on your use case:
For deeper memory architecture understanding: The way OpenJarvis implements persistent agent memory is closely related to Letta’s (formerly MemGPT) memory system. Reading the Letta Memory Architecture deep dive gives you a rigorous theoretical foundation for the episodic/semantic distinction and how different stateful agent frameworks approach the same problem differently.
For RAG pipeline fundamentals: OpenJarvis’s Learning module is a specialized application of the retrieval-augmented generation pattern. If you want to understand the underlying mechanics — chunking strategies, embedding models, retrieval algorithms, reranking — the Getting Started with LlamaIndex series covers these concepts in depth with framework-agnostic examples you can apply back to OpenJarvis.
For production deployment: Once you move beyond development, switch your storage backend from SQLite to a dedicated vector database (Qdrant for self-hosted, Pinecone for managed), replace Ollama with vLLM for throughput, and set up the [telemetry] module to log latency and memory hit rates to a monitoring system. These three changes are the difference between a developer’s local tool and a reliable production service.
For automated content generation: The Python SDK’s programmatic control over agents, combined with Learning’s document ingestion pipeline, makes OpenJarvis a natural fit for automated research and content workflows. Pair it with a task scheduler and a structured output schema to build pipelines that ingest new documents, synthesize knowledge, and generate structured outputs — entirely on your own hardware.