Advanced Letta Explore 2 min read

Letta Deployment and Production: Hosting Persistent Agents at Scale

#letta #deployment #production #docker #server #scaling #memgpt

Letta’s Deployment Model

Letta runs as a server — a persistent process that hosts all your agents, their memories, and tools. Unlike stateless function calls, Letta agents live on the server between requests.

This means deployment is different from deploying a Python script: you’re running a long-lived service that needs uptime, storage, and proper configuration.

Running Letta Server

Local Development

# Install
pip install letta

# Start server (default: localhost:8283)
letta server

# Check it's running
curl http://localhost:8283/v1/health
# → {"status": "ok"}

Configuration File

Letta reads from ~/.letta/config:

letta configure
# Interactive setup: sets LLM provider, embedding model, storage backend

Or set via environment variables:

export OPENAI_API_KEY="sk-..."
export LETTA_PG_URI="postgresql://user:pass@localhost/letta"  # optional: use PostgreSQL
letta server

Server Options

letta server \
  --host 0.0.0.0 \     # listen on all interfaces
  --port 8283 \
  --debug              # verbose logging

Docker Deployment

Single-Container Setup

# Dockerfile
FROM python:3.11-slim

WORKDIR /app

RUN pip install letta

# Copy configuration (or use env vars)
ENV OPENAI_API_KEY=""
ENV LETTA_SERVER_PASS=""  # server authentication token

EXPOSE 8283

CMD ["letta", "server", "--host", "0.0.0.0", "--port", "8283"]
docker build -t letta-server .
docker run -d \
  -p 8283:8283 \
  -e OPENAI_API_KEY="sk-..." \
  -e LETTA_SERVER_PASS="your-token" \
  -v letta-data:/root/.letta \
  --name letta \
  letta-server

Docker Compose with PostgreSQL

For production, use PostgreSQL instead of SQLite:

# docker-compose.yml
version: "3.8"

services:
  postgres:
    image: postgres:15
    environment:
      POSTGRES_DB: letta
      POSTGRES_USER: letta
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
    volumes:
      - pg-data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U letta"]
      interval: 10s
      retries: 5

  letta:
    image: python:3.11-slim
    command: sh -c "pip install letta && letta server --host 0.0.0.0 --port 8283"
    ports:
      - "8283:8283"
    environment:
      OPENAI_API_KEY: ${OPENAI_API_KEY}
      LETTA_SERVER_PASS: ${LETTA_SERVER_PASS}
      LETTA_PG_URI: postgresql://letta:${POSTGRES_PASSWORD}@postgres/letta
    depends_on:
      postgres:
        condition: service_healthy
    restart: always

volumes:
  pg-data:
# .env
OPENAI_API_KEY=sk-...
POSTGRES_PASSWORD=secure-password-here
LETTA_SERVER_PASS=api-access-token-here
docker compose up -d

Integrating with Web Applications

REST API

Letta exposes a full REST API. Authenticate with the server password:

import httpx

base_url = "http://localhost:8283"
headers = {"Authorization": "Bearer your-server-pass"}

# Create an agent via REST
response = httpx.post(
    f"{base_url}/v1/agents/",
    headers=headers,
    json={
        "name": "my_agent",
        "system": "You are a helpful assistant.",
        "llm_config": {
            "model": "gpt-4o-mini",
            "model_endpoint_type": "openai",
            "model_endpoint": "https://api.openai.com/v1",
            "context_window": 128000,
        },
        "embedding_config": {
            "embedding_model": "text-embedding-3-small",
            "embedding_endpoint_type": "openai",
            "embedding_endpoint": "https://api.openai.com/v1",
            "embedding_dim": 1536,
        },
        "memory": {
            "memory": {
                "human": {"label": "human", "value": "", "limit": 2000},
                "persona": {"label": "persona", "value": "I am a helpful assistant.", "limit": 2000},
            }
        },
    },
)
agent_id = response.json()["id"]

# Send a message
msg_response = httpx.post(
    f"{base_url}/v1/agents/{agent_id}/messages",
    headers=headers,
    json={"messages": [{"role": "user", "text": "Hello, remember my name is Alex."}]},
)
print(msg_response.json()["messages"][-1]["text"])

FastAPI Integration

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from letta import create_client

app = FastAPI()
client = create_client(base_url="http://localhost:8283", token="your-server-pass")

# Cache: user_id → agent_id mapping
user_agents: dict[str, str] = {}

class ChatRequest(BaseModel):
    user_id: str
    message: str

class ChatResponse(BaseModel):
    response: str
    agent_id: str

def get_or_create_agent(user_id: str) -> str:
    """Get existing agent for user, or create a new one."""
    if user_id in user_agents:
        return user_agents[user_id]

    # Check if agent exists in Letta
    agents = client.list_agents()
    for agent in agents:
        if agent.name == f"user_{user_id}":
            user_agents[user_id] = agent.id
            return agent.id

    # Create new agent
    from letta.schemas.memory import ChatMemory
    from letta.schemas.llm_config import LLMConfig
    from letta.schemas.embedding_config import EmbeddingConfig

    agent = client.create_agent(
        name=f"user_{user_id}",
        system="You are a helpful personal assistant. Remember user preferences and context.",
        memory=ChatMemory(human="", persona="I am a persistent personal assistant."),
        llm_config=LLMConfig(
            model="gpt-4o-mini",
            model_endpoint_type="openai",
            model_endpoint="https://api.openai.com/v1",
            context_window=128000,
        ),
        embedding_config=EmbeddingConfig(
            embedding_model="text-embedding-3-small",
            embedding_endpoint_type="openai",
            embedding_endpoint="https://api.openai.com/v1",
            embedding_dim=1536,
        ),
    )
    user_agents[user_id] = agent.id
    return agent.id

@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
    agent_id = get_or_create_agent(request.user_id)

    response = client.send_message(
        agent_id=agent_id,
        message=request.message,
        role="user",
    )

    text = next(
        (m.text for m in reversed(response.messages) if hasattr(m, "text") and m.text),
        "No response"
    )

    return ChatResponse(response=text, agent_id=agent_id)

@app.get("/agents/{user_id}/memory")
async def get_memory(user_id: str):
    agent_id = get_or_create_agent(user_id)
    memory = client.get_core_memory(agent_id)
    return {
        "human": memory.get_block("human").value,
        "persona": memory.get_block("persona").value,
    }

Scaling Considerations

Multiple Workers (Read-Heavy)

For read-heavy workloads (many users reading agent state), run multiple Letta server instances behind a load balancer, all pointing to the same PostgreSQL database:

upstream letta {
    server letta-1:8283;
    server letta-2:8283;
    server letta-3:8283;
}

Important: Agent memory writes are serialized per-agent in PostgreSQL. Avoid concurrent writes to the same agent from multiple servers.

Per-User Agent Management

Don’t create a new agent per request. Create one agent per user and reuse it:

# Good: agent persists, accumulates memory
agent_id = get_or_create_agent(user_id)
client.send_message(agent_id=agent_id, message=msg, role="user")

# Bad: creates new agent every time, no memory accumulation
agent = client.create_agent(...)
client.send_message(agent_id=agent.id, message=msg, role="user")

Memory Maintenance

For long-lived production agents, periodically compact archival memory to prevent context window bloat:

# Search and summarize old memories
old_memories = client.get_archival_memory(
    agent_id=agent_id,
    query="",  # all memories
    limit=100,
)

# Summarize old memories into the agent's human block
if len(old_memories) > 50:
    client.send_message(
        agent_id=agent_id,
        message="Please summarize the key facts about me stored in your archival memory and update your core memory.",
        role="system",
    )

Monitoring and Observability

# List all agents and their stats
agents = client.list_agents()
for agent in agents:
    print(f"{agent.name}: {agent.id}")

# Check agent message count (proxy for activity)
messages = client.get_messages(agent_id=agent_id, limit=1)
print(f"Agent has at least {len(messages)} messages")

# Health check endpoint
import httpx
health = httpx.get("http://localhost:8283/v1/health")
print(health.json())  # {"status": "ok"}

Frequently Asked Questions

Can I run Letta without OpenAI?

Yes. Configure Letta to use any OpenAI-compatible API endpoint:

letta configure
# Select: openai_chat_completions (compatible)
# Set endpoint: http://localhost:11434/v1  (Ollama example)
# Set model: llama3.2

Local models work but require more RAM and produce lower-quality memory management.

How do I back up agent memories?

Back up the PostgreSQL database (or the SQLite file at ~/.letta/sqlite.db). All agent state, memories, and tools are stored there. For PostgreSQL:

pg_dump -U letta letta > backup.sql

What’s the maximum number of agents I can run?

Memory and compute are the only limits. Each agent’s active context window is loaded on first message. For inactive agents, storage is the only cost. A 4GB server can support hundreds of concurrent agents and thousands of total agents.

Can I use Letta with Claude instead of GPT-4o?

Yes:

llm_config = LLMConfig(
    model="claude-sonnet-4-6",
    model_endpoint_type="anthropic",
    model_endpoint="https://api.anthropic.com",
    context_window=200000,
)

Set ANTHROPIC_API_KEY in your environment.

How do I handle agent versioning?

Letta doesn’t have built-in versioning. Best practice: use agent name conventions (e.g., user_{id}_v2) and migrate by creating a new agent and copying key memories via archival_memory_insert.

Next Steps

Related Articles