Intermediate Multi-agent 20 min read

Agent Communication and State Management in Multi-Agent Systems

#communication #state-management #shared-memory #observability #tracing #langfuse #opentelemetry
📚

Read these first:

The Communication Problem in Multi-Agent Systems

A single agent is impressive. It reasons, plans, calls tools, and produces output. But a single agent is ultimately bounded by its context window, its tool set, and its compute budget. Multi-agent systems (MAS) exist to transcend those limits — but only if the agents can actually communicate.

Without communication, a collection of agents is not a system. It is a set of parallel chatbots, each working in isolation, duplicating effort, contradicting each other, and producing results that cannot be synthesized into a coherent whole. Communication is the infrastructure that transforms a collection into a system.

Three specific problems must be solved before communication can be called effective:

Sharing findings. Agent A discovers that a particular API rate-limits at 100 requests per minute. Agent B, working in parallel, needs to know this before it hits the same wall. Without a mechanism to share this finding, B will discover it again the hard way — burning time, tokens, and potentially quota.

Avoiding duplicate work. In a research pipeline with five agents crawling different sources, it is almost certain that two agents will encounter the same document. Without a shared registry of “already visited” resources, the system will waste significant capacity on redundant processing.

Coordinating on shared state. When multiple agents contribute to the same output — a codebase, a report, a database — they must agree on who owns what, who has the latest version, and how to merge conflicting contributions. This is the hardest problem, and it is the one most systems get wrong first.

This article covers the four primary communication patterns used in production MAS, a generalizable design principle for separating capability from knowledge, the core challenges of state management, and the observability tools that make these systems debuggable at scale.


Four Communication Patterns

No single communication pattern is optimal for all situations. The right choice depends on coupling requirements, consistency needs, and scale targets. Here are the four patterns that cover the vast majority of production MAS designs.

1. Direct Message Passing

In direct message passing, Agent A explicitly sends a structured message to Agent B. The message contains the input B needs, the context A wants to share, and optionally a return address for B’s response. This can be synchronous (A blocks until B replies) or asynchronous (A fires and continues, B replies later via a callback or queue).

Direct message passing is the most explicit pattern. Every communication is an intentional act, visible in logs and easy to trace. When something goes wrong, the causal chain is readable: “A sent this to B, B returned that.”

The trade-off is coupling. Agent A must know B’s address and interface. If B’s input schema changes, A must be updated. If B is replaced with a different implementation, A needs to know. In a two-agent system this is manageable. In a twenty-agent system it becomes a maintenance burden.

Direct message passing works best for fixed, known agent topologies — a supervisor agent dispatching tasks to a fixed pool of worker agents, or a pipeline where each stage has a single upstream and a single downstream.

2. Shared Memory Store

In shared memory, agents do not communicate with each other directly. Instead, they read from and write to a common knowledge base — a vector database, a SQL database, a document store, or a file system directory. Agent A writes its findings. Agent B, at some later point, queries the store and retrieves what it needs.

This pattern achieves loose coupling. Agent A does not need to know that Agent B exists. B does not need to know that A wrote the finding. They are connected only through the store, which means agents can be added, removed, or replaced without updating every other agent in the system.

The problem is consistency. If A and B both attempt to update the same record simultaneously, the result depends on the implementation: one write may silently overwrite the other, or both writes may succeed and produce a corrupted merged state. For read-heavy workloads — research pipelines, knowledge accumulation — this risk is low. For write-heavy workloads where multiple agents modify the same artifact, consistency must be explicitly managed.

Agent A ──write──▶ Vector DB ◀──query── Agent C
Agent B ──write──▶           ◀──query── Agent D

The shared vector DB becomes the system’s long-term memory. Agents query it semantically, retrieving chunks relevant to their current task without needing to know which agent wrote those chunks or when.

3. Event Bus (Publish/Subscribe)

In the event bus pattern, agents do not communicate with other agents at all — they communicate with a bus. An agent publishes an event (“research complete for topic X”, “test suite failed on file Y”, “document saved to path Z”) and moves on. Other agents that have subscribed to that event type receive it and react accordingly.

This is the most decoupled pattern. No agent needs to know which other agents exist. New agents can join the system by subscribing to relevant event types, without any change to existing agents. The event bus itself handles routing, buffering, and delivery.

The trade-off is traceability. In message passing, you can follow a chain: A called B, B called C. In an event-driven system, causality is implicit in the event sequence, and reconstructing it requires a log of all events with timestamps. Without good observability tooling, debugging event-driven MAS feels like reading tea leaves.

Event bus patterns work best for large agent networks where the set of agents is dynamic, for event-driven workflows where agents react to external triggers, and for systems where decoupling is more important than visibility.

4. Shared File System

The shared file system is the simplest pattern that works across separate processes and separate machines. Agents write their outputs to files in a shared directory. Other agents poll that directory or watch for file changes. When a file appears, the watching agent reads it and processes it.

This pattern requires no special infrastructure beyond a file system — local, NFS, or object storage like S3 all work. It trivially crosses machine boundaries, which makes it the natural choice for distributed agents running on different hardware. An agent on machine A writes a result file. An agent on machine B picks it up seconds later.

The limitations are equally obvious: polling is inefficient, there is no built-in notification mechanism, and large files create transfer bottlenecks. Shared file systems are best for situations where simplicity and cross-machine compatibility matter more than latency.

Pattern Comparison

PatternCouplingConsistencyScalabilityBest For
Direct Message PassingTightStrongLimitedFixed topologies, supervisor-worker
Shared Memory StoreLooseModerateHighKnowledge accumulation, retrieval pipelines
Event BusNoneWeakVery HighDynamic networks, event-driven workflows
Shared File SystemNoneWeakModerateDistributed agents, cross-machine handoff

Tools vs Skills: The OpenClaw Model

One of the most instructive real-world MAS design patterns comes from OpenClaw, an autonomous multi-agent framework that makes a strict architectural separation between what an agent can physically do and what an agent knows how to do. This separation — Tools versus Skills — is a generalizable principle that applies far beyond any single framework.

Why Separate Capability from Knowledge?

The common mistake in early MAS design is bundling system access and procedural knowledge into a single agent configuration. The agent is given both the instructions for completing a task and the system permissions required to carry it out, packaged together.

The problem surfaces when an agent needs to be shared across contexts. A research agent that can read and write files is useful for document analysis. The same agent given access to a production database is dangerous. If knowledge and capability are bundled, the only way to restrict the agent in the database context is to create an entirely new agent configuration — duplicating all the knowledge and changing only the permissions. This does not scale.

There is also a security dimension. If an agent’s instructions are somehow leaked or manipulated (prompt injection, a compromised upstream agent), and those instructions are bundled with system-level permissions, the blast radius of the compromise is maximized. Separating knowledge from capability limits what an attacker can achieve with a compromised instruction set.

The Two-Tier Tool Model

OpenClaw defines tools in two tiers:

Core Capabilities (8 tools): These are the agent’s basic interaction rights with its environment. They include read, write, edit, apply_patch, exec, process, web_search, and web_fetch. These tools represent what the agent can physically do at the system level — read a file, write a file, execute a command, search the web. An agent without these tools can reason and plan but cannot act. It is a brain without limbs.

Advanced Capabilities (18 tools): These include browser automation, image processing, node orchestration, sessions_spawn, subagents, cron scheduling, and the lobster workflow runtime. These tools transform the agent from a passive task executor into an active orchestrator. With sessions_spawn, the agent can create new agent instances. With subagents, it can delegate subtasks. With cron, it can schedule its own future actions. An agent with advanced capabilities is qualitatively different from one with only core capabilities — it can modify the agent network itself.

The significance of this tiering is that permissions can be granted at a fine-grained level. An agent can be given read access without write access, making it safe to deploy in read-only audit contexts. An agent can be given web_search without exec, letting it research but not execute code.

Skills as Textbooks

If Tools define what an agent can do, Skills define how to do it. In OpenClaw, a Skill is a markdown file (SKILL.md) containing step-by-step instructions for combining tools to accomplish a specific workflow. OpenClaw ships with 53 official bundled Skills covering domains like Obsidian notes management, GitHub repository operations, Google Workspace (Docs, Sheets, Gmail), email, and calendar.

A Skill is not code. It does not grant permissions. It is procedural knowledge — a textbook chapter that the agent reads before starting a task. The agent knows the workflow. The Tools determine whether it can execute each step.

The critical insight is what happens when you combine them asymmetrically. Give an agent the Obsidian notes Skill (which teaches it how to create, link, and organize notes in Obsidian’s vault format) but deny it the write tool. The agent now knows exactly what it should write and where — but it cannot write anything. It can plan the perfect note structure, generate the perfect content, and describe every action it would take. Then stop, because it lacks the permission to act.

This is not a failure mode. It is a deliberate security feature. You can deploy the same agent in a review mode (read-only) and an action mode (read-write) simply by toggling tool permissions. The knowledge does not change. The instructions do not change. Only the capability boundary changes.

Applying the Principle Beyond OpenClaw

The Tools/Skills pattern is not proprietary to OpenClaw. It is a generalizable MAS design principle that can be applied to any agent system.

In your own MAS, “Tools” are the APIs, system calls, and external services the agent has authenticated access to. “Skills” are the prompts, instructions, and workflow documents the agent receives as context.

Apply the principle: define action rights separately from workflow knowledge. Grant agents the minimum tool set required for their specific role. Attach Skills (workflow instructions) as needed without expanding permissions. The result is fine-grained security control and reusable knowledge modules that can be shared across agents with different permission profiles.


State Management: Keeping Agents in Sync

Communication solves the routing problem — how information gets from one agent to another. State management solves the coherence problem — ensuring that all agents share a consistent, recoverable view of the task’s progress.

Challenge 1: Consistency

The consistency problem arises when two agents simultaneously update the same shared artifact. Consider a code review system: Agent A (Writer) is updating a function, Agent B (Reviewer) is reading the same function to analyze it. If A writes while B is reading, B may see a partially updated state — neither the old version nor the new version, but an incoherent intermediate.

Several strategies address this:

Optimistic locking lets agents read and update freely, but each update includes the version number the agent read. If the version in the store has changed by the time the update arrives, the update is rejected and the agent must retry with the new version.

Last-write-wins is the simplest strategy: the most recent write is kept, older writes are discarded. This works when writes are independent and overwrites are acceptable. It fails when two agents are merging contributions into a shared document.

Append-only event log is the most robust pattern for multi-agent systems. Instead of agents updating a central record, they append events to an immutable log (“Agent A modified function X at time T, new content: …”). The current state is derived by replaying the log. Concurrent appends are safe because they do not conflict — two appends to the end of a log simply both land, in some order. The log is also a perfect audit trail.

In practice, most production MAS use append-only logs for their shared state, especially in compliance-sensitive domains where auditability matters.

Challenge 2: Context Window Limits

Each agent has a finite context window. A long-running task that involves dozens of documents, hundreds of tool calls, and thousands of tokens of intermediate reasoning cannot fit entirely in any single agent’s context. This creates a fundamental tension: agents need to know about prior work to avoid duplicating it, but they cannot hold all prior work in memory.

The solution is external memory with selective retrieval. Rather than passing all prior work to each agent in its context window, the system maintains an external store — typically a vector database — where agents write their findings as they work. When a new agent starts or an existing agent needs to recall something, it queries the store with a semantic search: “What do we know about rate limits on the OpenAI API?” The store returns the relevant chunks, not the entire history.

This pattern bridges agents across time. An agent that ran an hour ago can communicate with an agent running now, without either agent needing to be aware of the other. The knowledge persists in the store; the agents are transient.

Challenge 3: Failure Recovery

In a long-running 20-step agentic workflow, the probability that at least one step will encounter an error approaches certainty. Network timeouts, API rate limits, model refusals, and hardware failures are all possible. Without failure recovery, a crash at step 15 means restarting from step 1 — re-running 14 completed steps at full cost.

The solution is checkpointing: persisting the task state after each completed step so that recovery can resume from the most recent checkpoint rather than the beginning.

LangGraph implements this natively through its state persistence layer. Each node in a LangGraph graph can save its output state to a persistent store (SQLite, Redis, or a custom backend). If the graph execution fails at any node, resuming the graph picks up from the last saved checkpoint. For long-running multi-agent pipelines, this is not a nice-to-have feature — it is a prerequisite for production reliability.

State Pattern Reference

PatternConcurrency SafetyAuditabilityRecovery SupportTypical Use Case
Append-only event logHighFullYes (replay from log)Financial, compliance, research
Vector store for findingsModeratePartialPartialResearch pipelines, knowledge accumulation
Checkpointed graph stateHighPartialNative (LangGraph)Multi-step agentic workflows
Shared file handoffLowLowManualDistributed agents, simple pipelines

Observability: Seeing Inside Your Multi-Agent System

A single agent’s behavior is one thread of reasoning. Even when it makes mistakes, the trace is readable: you can see what the model was prompted with, what tool calls it made, and what it returned. Debugging is inconvenient but tractable.

A multi-agent system creates a forest of concurrent threads. Five agents running simultaneously produce five interleaved traces, with causal dependencies that cross thread boundaries. Agent 3 made a decision based on output from Agent 1, which was in turn responding to a query from Agent 5. Without tooling that can surface these cross-agent dependencies, debugging is not just inconvenient — it is guesswork.

The three questions that observability must answer for MAS:

  1. Which agent made this decision, and what did it see when it made it?
  2. Where is this system spending its token budget? Which agent is responsible for 80% of the cost?
  3. When something went wrong, which agent produced the bad output, and what was its input?

Distributed Tracing with Langfuse

Langfuse has emerged as the most framework-agnostic observability backend for multi-agent systems. It works with LangGraph, CrewAI, AutoGen, and custom-built agent systems, providing a unified view across all of them.

The core concept is traces and spans. A trace represents a single end-to-end request through your system — from the user’s initial query to the final response. Within that trace, each agent operation is a span: a unit of work with a start time, end time, input, output, and metadata. Related spans are grouped into a trace, and the hierarchical relationship between spans makes the causal chain explicit.

With Langfuse instrumented, you can answer questions like: “The user asked X. Agent 1 (Planner) broke it into three subtasks. Agent 2 (Researcher) handled subtask 1 and returned Y. Agent 3 (Writer) received Y and produced Z. The final answer was assembled from Z plus Agent 4’s output.” Every step is logged, timestamped, and queryable.

Langfuse also enables cost attribution. Since each span includes the token count for the underlying model call, you can see exactly how much each agent costs per request. If Agent 2 is consuming 70% of your token budget, you know where to optimize.

Framework-Specific Integration

LangGraph integrates observability through its CallbackHandler system. Adding a Langfuse callback to a LangGraph execution produces a decision tree visualization of the graph’s execution: which nodes fired, in which order, what state was passed between them. The .score_current_trace() method allows human feedback to be attached directly to a trace — when a reviewer marks a response as incorrect, that signal flows back to the specific agent that produced it.

CrewAI integrates with the OpenInference SDK, which automatically generates OpenTelemetry (OTel) spans when crew.kickoff() is called. No manual instrumentation of individual agents is required — the CrewAI runtime handles span creation for each agent’s turn, each tool call, and each inter-agent delegation. These spans can be exported to any OTel-compatible backend, including Langfuse.

AutoGen integrates with OpenLit for distributed tracing. A useful feature of this integration is the shouldExportSpan lambda, which allows you to filter the spans that are sent to the backend. AutoGen’s SingleThreadedAgentRuntime generates considerable internal telemetry noise — infrastructure spans for message routing, queue management, and runtime bookkeeping. The filter lambda lets you pass only the spans that correspond to actual agent interactions, keeping your trace backend clean and your dashboards readable.

What Good Observability Looks Like in Practice

When observability is working correctly, a post-mortem on a failed MAS request looks like this:

The trace shows that the user’s query arrived at 14:31:22. The Planner agent processed it in 8 seconds and dispatched three subtasks. Subtask 1 completed normally at 14:31:55. Subtask 2 ran from 14:31:55 to 14:33:12 — the span shows the Researcher agent received a malformed query from the Planner, and its output was a confused summary of a different topic. The Writer agent received this malformed output at 14:33:13, recognized the inconsistency, and produced a best-effort response that did not fully answer the user’s question.

Without tracing, this failure looks like “the system gave a bad answer.” With tracing, it looks like “the Planner’s query serialization for subtask 2 had a bug that corrupted the Researcher’s input.” The first framing makes fixing the problem impossible. The second makes it straightforward.

Good MAS observability also means watching trends over time, not just debugging individual failures. Which agents are getting slower as your content volume grows? Which agents produce outputs that consistently get low human feedback scores? Which inter-agent handoffs produce the most errors? These questions require a persistent trace store and query tooling — which is exactly what Langfuse and similar backends provide.


Frequently Asked Questions

How do agents avoid overwriting each other’s work?

The most reliable approach is the append-only event log pattern: instead of agents updating a central record, they append immutable events describing their actions. Concurrent appends are always safe because they do not conflict — they simply both land in the log. If you need a mutable shared artifact, use optimistic locking with version numbers, so that an agent that reads version 5 and tries to write version 6 will fail if another agent has already written version 6, forcing a retry with the current state.

What is the simplest shared memory solution for a small MAS?

For a small system — two to five agents — a shared SQLite database or a simple JSON file in a known location handles most needs. SQLite provides concurrent read access and serialized writes without requiring a separate server process. Agents can read findings freely and append new rows without complex locking. For semantic retrieval across accumulated knowledge, a local vector store like ChromaDB adds semantic search on top of the same basic pattern without external infrastructure.

Can agents in a MAS use different LLM providers?

Yes, and this is often a deliberate design choice. A supervisor agent might use a capable frontier model for high-level planning, while worker agents use faster, cheaper models for narrow, well-defined subtasks. Communication patterns (message passing, shared memory) work independently of which model backs each agent — the interface between agents is the message format or the shared store schema, not the model API. The only constraint is that the outputs each agent produces must conform to whatever schema downstream agents or the shared store expect.

How do I trace a single request through five or more agents?

Start with a trace ID generated at the entry point of your system — the moment the user’s request arrives. Pass this trace ID as metadata through every agent invocation, every message, and every write to shared state. Configure your observability backend (Langfuse, or any OTel-compatible backend) to index all spans by trace ID. When you need to investigate a specific request, query by its trace ID and you will retrieve every span from every agent that touched it, in order, with timing and I/O. This single-trace-ID discipline is the foundation of debuggable multi-agent systems.


Next Steps

With communication patterns and state management in place, the next layer of MAS architecture is orchestration — the strategies that govern how agents are assigned work, in what order, and under what conditions. See Multi-Agent Orchestration Patterns for a deep dive into supervisor, hierarchical, and emergent coordination models.

For a concrete production example of the Tools/Skills separation described in this article, OpenClaw Multi-Agent System shows how these principles are implemented end-to-end in a real framework.

If you are building with CrewAI and want to implement the shared memory patterns discussed here, CrewAI Memory and Knowledge covers CrewAI’s built-in memory abstractions and how to connect them to external knowledge stores.