Here is the complete article body:
If you’ve ever built a chatbot that forgets everything after each response, you’ve already felt the gap that A Developer’s Guide to AI Agent Memory: Short-Term vs. Long-Term addresses head-on. Memory is what separates a stateless text-completion tool from an agent that reasons, learns, and acts with context. This guide walks you through the architecture, the implementation patterns, and the production trade-offs you’ll need to make as you build agents that actually remember.
Why Memory Transforms Stateless LLMs into Useful Agents
Every LLM call is, by default, stateless — the model has no awareness of previous interactions unless you explicitly pass that context in the prompt. For simple Q&A, that’s fine. For agents that manage multi-step tasks, work across sessions, or learn user preferences, statefulness is non-negotiable.
Think of the difference between a calculator and a financial advisor. The calculator has no memory: ask it the same question twice and it gives the same answer. The advisor remembers your portfolio, your risk tolerance, last quarter’s decisions — and uses all of that to give relevant advice today.
AI agent memory systems solve this by storing, retrieving, and injecting relevant context at the right time. The two primary tiers are:
- Short-term memory (in-context memory): information available within the current conversation or task window
- Long-term memory (external memory): information persisted across sessions, retrieved via search or lookup
Getting these two tiers working together is the core skill this guide teaches.
Understanding the Memory Architecture
Before writing code, let’s map the full flow. When an agent receives a user message, it must decide what context to pull from which memory tier, combine it with the current input, and generate a grounded response.
flowchart TD
U([User Message]) --> A[Agent Orchestrator]
A --> STM[Short-Term Memory\nRolling Buffer]
A --> LTM[Long-Term Memory\nVector Store]
STM --> CTX[Context Builder]
LTM --> CTX
CTX --> LLM[LLM Call]
LLM --> R([Response])
R --> STM
R --> WH{Write to\nLong-Term?}
WH -->|Yes| LTM
WH -->|No| END([Done])
The orchestrator is the key piece: it queries both memory tiers, assembles a context window, and decides what to persist after each turn. You’ll implement this pattern step by step below.
Setting Up Your Memory Stack
You need Python 3.10+, an OpenAI API key, and a local vector database. This guide uses ChromaDB for long-term storage because it runs entirely in-process — no Docker, no extra services.
pip install openai chromadb tiktoken
Set your API key:
export OPENAI_API_KEY="sk-..."
Create the project structure:
mkdir agent_memory && cd agent_memory
touch memory.py agent.py main.py
Implementing Short-Term Memory
Short-term memory is the rolling window of recent messages passed directly in the prompt. Its job is to give the LLM the immediate conversational context it needs. The constraint is the context window limit — you can’t pass unlimited history, so you need a strategy.
The simplest strategy is a fixed-length rolling buffer that keeps the last N exchanges. A more sophisticated version uses token counting to keep as much history as fits without exceeding the model’s limit.
# memory.py
from collections import deque
import tiktoken
class ShortTermMemory:
"""Rolling buffer of recent conversation turns with token budget enforcement."""
def __init__(self, max_tokens: int = 2000, model: str = "gpt-4o-mini"):
self.turns: deque[dict] = deque()
self.max_tokens = max_tokens
self.encoder = tiktoken.encoding_for_model(model)
def _count_tokens(self, text: str) -> int:
return len(self.encoder.encode(text))
def add(self, role: str, content: str) -> None:
self.turns.append({"role": role, "content": content})
self._evict_if_needed()
def _evict_if_needed(self) -> None:
while self._total_tokens() > self.max_tokens and self.turns:
self.turns.popleft()
def _total_tokens(self) -> int:
return sum(self._count_tokens(t["content"]) for t in self.turns)
def get_messages(self) -> list[dict]:
return list(self.turns)
def clear(self) -> None:
self.turns.clear()
Key design choices here:
dequegives O(1) pop from the front, which is the eviction path- Token counting uses the same tokenizer as the model, so the budget is accurate
get_messages()returns the format expected directly by the OpenAI messages API
Implementing Long-Term Memory with Vector Storage
Long-term memory stores facts, preferences, and summaries that should survive across sessions. Because you can’t search it by keyword reliably — you’re searching by semantic meaning — it’s backed by a vector store.
The workflow is:
- At the end of a session (or after important turns), embed the content and store it
- At the start of each new turn, query the vector store with the current user message
- Inject the top-K results as additional context before the LLM call
# memory.py (continued)
import chromadb
from openai import OpenAI
class LongTermMemory:
"""Semantic vector store using ChromaDB for persistent cross-session memory."""
def __init__(self, collection_name: str = "agent_memory"):
self.client = chromadb.Client()
self.collection = self.client.get_or_create_collection(
name=collection_name,
metadata={"heuristic": "cosine"}
)
self.openai = OpenAI()
def _embed(self, text: str) -> list[float]:
response = self.openai.embeddings.create(
input=text,
model="text-embedding-3-small"
)
return response.data[0].embedding
def store(self, text: str, metadata: dict | None = None) -> None:
embedding = self._embed(text)
doc_id = f"mem_{hash(text) & 0xFFFFFFFF}"
self.collection.upsert(
ids=[doc_id],
embeddings=[embedding],
documents=[text],
metadatas=[metadata or {}]
)
def retrieve(self, query: str, n_results: int = 3) -> list[str]:
embedding = self._embed(query)
results = self.collection.query(
query_embeddings=[embedding],
n_results=n_results
)
return results["documents"][0] if results["documents"] else []
The upsert call is intentional — if you try to store the same fact twice (duplicate session data), it overwrites rather than creating duplicates. This is a simple deduplication strategy; in production you’d want content-based hashing or a separate dedup pass.
For a deeper look at how vector retrieval works inside agent frameworks, see LlamaIndex vs LangChain for RAG: Which Framework to Choose? — both frameworks offer higher-level memory abstractions built on exactly this pattern.
Building the Agent Orchestrator
Now wire both memory tiers into a working agent loop:
# agent.py
from openai import OpenAI
from memory import ShortTermMemory, LongTermMemory
SYSTEM_PROMPT = """You are a helpful assistant with access to both recent conversation
history and relevant long-term memories. Use this context to give consistent,
personalized responses. If long-term memories are provided, treat them as established
facts about the user or prior sessions."""
class MemoryAgent:
def __init__(self):
self.llm = OpenAI()
self.short_term = ShortTermMemory(max_tokens=2000)
self.long_term = LongTermMemory()
def _build_context_prompt(self, memories: list[str]) -> str:
if not memories:
return ""
joined = "\n".join(f"- {m}" for m in memories)
return f"\n\n[Relevant memories from past sessions]\n{joined}\n"
def chat(self, user_message: str) -> str:
# 1. Retrieve relevant long-term memories
memories = self.long_term.retrieve(user_message, n_results=3)
memory_context = self._build_context_prompt(memories)
# 2. Build system prompt with injected memory context
system = SYSTEM_PROMPT + memory_context
# 3. Assemble messages: system + short-term history + new user turn
messages = [{"role": "system", "content": system}]
messages += self.short_term.get_messages()
messages.append({"role": "user", "content": user_message})
# 4. Call the LLM
response = self.llm.chat.completions.create(
model="gpt-4o-mini",
messages=messages
)
assistant_reply = response.choices[0].message.content
# 5. Update short-term memory
self.short_term.add("user", user_message)
self.short_term.add("assistant", assistant_reply)
return assistant_reply
def remember(self, fact: str, metadata: dict | None = None) -> None:
"""Explicitly persist a fact to long-term memory."""
self.long_term.store(fact, metadata)
def end_session_summary(self, summary: str) -> None:
"""Store a summary of the session in long-term memory and clear short-term."""
self.long_term.store(summary, {"type": "session_summary"})
self.short_term.clear()
And the entry point to test it:
# main.py
from agent import MemoryAgent
agent = MemoryAgent()
# Seed some long-term memories (simulating a returning user)
agent.remember("User's name is Alex and they prefer Python over JavaScript.")
agent.remember("Alex is building a multi-agent pipeline for document summarization.")
agent.remember("Alex encountered ChromaDB permission issues on Windows in March 2026.")
print("Agent ready. Type 'quit' to exit.\n")
while True:
user_input = input("You: ").strip()
if user_input.lower() in ("quit", "exit"):
# Save a session summary before exiting
agent.end_session_summary(
f"Session ended. Alex asked about: {user_input}"
)
print("Session summary saved. Goodbye!")
break
response = agent.chat(user_input)
print(f"Agent: {response}\n")
Run it:
python main.py
Try asking: “What are you helping me build?” — the agent should correctly surface the document summarization project from long-term memory without you mentioning it.
Production Patterns and Pitfalls
Moving this pattern to production surfaces several issues worth planning for:
Memory decay and relevance scoring. Not all memories age equally. Add a timestamp field to every long-term memory and weight retrieval scores against recency. A user preference from two years ago may be stale.
Memory write strategy. Storing every turn is expensive and noisy. Use one of three strategies:
- Explicit writes: only store when the agent identifies a declarative fact (“I prefer X”, “I work at Y”)
- Summarization writes: at session end, summarize with an LLM call and store the summary
- Importance scoring: use a secondary LLM call to score each turn; only persist above a threshold
Namespace isolation. In multi-user systems, prefix every collection or document ID with a user ID. Never let two users’ memories bleed into each other’s context.
Token budget accounting. The total injected tokens (short-term buffer + retrieved memories + system prompt) must fit within the model’s context limit. Keep a running token counter before each LLM call and trim retrieved memories first if you’re over budget.
If you’re building a multi-agent system where memory must be shared across agents, the coordination layer matters as much as the storage layer. See What Is Paperclip? The AI Agent Team Orchestration Platform for a production-grade approach to shared state across agent teams.
For a low-code implementation of the same memory architecture, Build an AI Chatbot with Memory in n8n shows how to wire up the same short-term and long-term tiers using a visual workflow builder.
Frequently Asked Questions
What is the difference between short-term and long-term memory in AI agents?
Short-term memory is the conversation history passed directly inside the prompt — it’s in-context and disappears when the session ends or the context window fills up. Long-term memory is stored externally (typically in a vector database or key-value store) and persists across sessions. It is retrieved semantically at query time and injected into the prompt as needed. Both tiers serve different roles: short-term gives conversational coherence; long-term gives cross-session continuity.
How do I decide what to store in long-term memory?
Store information that would be useful to recall in a future session that the agent couldn’t infer from a fresh conversation. This typically includes: user preferences, stated goals, prior decisions, established facts about the domain, and session summaries. Avoid storing low-value turns like greetings or simple confirmations — they inflate retrieval noise without improving response quality.
What happens when the vector store returns irrelevant results?
Semantic search is probabilistic and can return false positives. Guard against this with a minimum similarity threshold — discard any retrieved document whose cosine similarity to the query is below ~0.75. ChromaDB returns distances (lower = more similar); calibrate your threshold against your specific embedding model and domain vocabulary.
Can I use this memory pattern with frameworks like LangChain or LlamaIndex?
Yes. LangChain provides ConversationBufferWindowMemory (short-term) and VectorStoreRetrieverMemory (long-term) as drop-in components. LlamaIndex has a ChatMemoryBuffer and VectorMemory that implement the same tiers. The underlying pattern described in this guide is identical — the frameworks just abstract the token counting, embedding, and retrieval calls. Building it from scratch first (as done here) gives you the mental model to debug and extend those abstractions confidently.
How do I handle memory for multi-user production deployments?
Namespace strictly by user ID. In ChromaDB, create one collection per user or prefix all document IDs with the user ID and filter on metadata. Never share a collection between users. For short-term memory, instantiate a separate ShortTermMemory object per session and store it server-side keyed to a session token — do not rely on client-side state. Apply the same isolation rules to any in-memory caches that wrap your vector store client.