Intermediate Multi-agent 15 min read

Multi-Agent Architecture Topologies: Centralized vs Distributed

#multi-agent #architecture #topology #centralized #distributed #dag #chatdev #agentnet
📚

Read these first:

How you wire your agents together determines everything — how your system scales, what happens when one component fails, how easy it is to debug, and how well it holds up against adversarial inputs. Topology is not just a design detail; it is the load-bearing structure of a multi-agent system (MAS).

This article maps the two foundational topologies — centralized hierarchical and distributed autonomous — with real examples from production-grade systems like ChatDev, MetaGPT, and AgentNet. By the end, you will know which architecture matches your task profile and what trade-offs you are accepting in each case.


Two Fundamental Topologies

Every multi-agent system sits somewhere on a spectrum from fully centralized to fully distributed. The position you choose determines four key system properties:

TopologyControl FlowScalabilityFault ToleranceComplexity
Centralized HierarchicalTop-down, coordinator-directedModerate (coordinator is the ceiling)Low (single point of failure)Low to moderate
Distributed AutonomousPeer-to-peer, self-organizingHigh (horizontal scale-out)High (emergent resilience)High
HybridCoordinator routes, workers operate autonomouslyHighModerateModerate

The choice is never purely technical. Task structure, team skills, compliance requirements, and operational maturity all feed into it. But the architecture must match the task, not the other way around.


Centralized Hierarchical Architecture

In a centralized topology, one agent — typically called a coordinator, orchestrator, or router — receives the goal, decomposes it into sub-tasks, assigns each sub-task to a specialist agent, and aggregates results. All information flows through the coordinator. It maintains the complete view of task state at every point in execution.

The defining properties are:

  • Determinism. The coordinator controls sequencing, so execution order is predictable and reproducible.
  • Traceability. Because every decision routes through one node, audit logs are straightforward.
  • Ease of debugging. Follow the coordinator’s decision log and you can reconstruct exactly what happened and why.
  • Bottleneck risk. Every sub-task, result, and message passes through a single node. If that node is slow or wrong, everything downstream suffers.

The Coordinator Pattern

The simplest mental model: a Coordinator at the top of the graph, arrows pointing down to four to six specialist agents, no direct links between specialists.

              [Coordinator]
             /    |    |    \
        [Agent  [Agent [Agent [Agent
          A]      B]    C]     D]

The coordinator receives the user goal, breaks it into task_A ... task_D, dispatches each to the appropriate specialist, collects the outputs, and either assembles the final result or loops back with correction instructions.

This pattern is what LangGraph’s supervisor mode implements: a supervisor node that routes messages to worker nodes and receives their outputs before deciding the next step.

ChatDev: The Software Factory Model

ChatDev maps each stage of the software development lifecycle to a distinct agent role: CEO, CTO, Programmer, Reviewer, QA Tester. A central coordinator assigns work in waterfall sequence — requirements to design to implementation to testing to delivery — with each agent passing its output to the next in the chain.

This sequential-centralized model has a specific set of strengths:

  • Clarity. Every agent knows exactly what role it plays and what it is expected to produce.
  • Predictability. For a well-defined software task with a clear specification, the output is consistent across runs.
  • Low coordination overhead. The sequential handoff means agents do not need to negotiate or discover each other dynamically.

The weaknesses are equally clear:

  • Brittleness. When requirements change mid-task, the fixed role structure has no mechanism to adapt. The waterfall sequence does not accommodate feedback loops.
  • Coordinator bottleneck. All routing and sequencing happens at the top. If the coordinator misinterprets the goal, every downstream agent inherits that error.
  • Scaling ceiling. Adding more agents to a sequential pipeline adds latency, not capacity.

ChatDev excels at repetitive software generation tasks with known inputs and well-specified outputs. It performs poorly on exploratory or ambiguous tasks where the requirements evolve during execution.

MetaGPT: SOP as Agent Instructions

MetaGPT advances the centralized model in two important ways. First, it encodes Standard Operating Procedures (SOPs) as structured inter-agent messages rather than implicit role descriptions. Each agent receives not just a task but a formal process definition — inputs, expected outputs, constraints, and success criteria.

Second, MetaGPT introduces a metacognitive self-correction layer. Agents can evaluate their own reasoning outputs, identify inconsistencies, and revise before passing results downstream. This vertical self-improvement loop reduces error propagation without requiring a separate review agent.

The practical result is that MetaGPT handles more complex software projects than ChatDev. When process knowledge can be formalized — when you can write down the steps an expert would follow — MetaGPT’s SOP encoding converts that knowledge into agent behavior reliably.

The centralized constraint still applies: the coordinator distributes SOPs and manages task state. MetaGPT does not escape the coordinator bottleneck, but it reduces the coordinator’s error rate by building quality checks into the agents themselves.

Specialized Task Patterns

Two more centralized patterns are worth naming for the specific problems they solve.

TRANSAGENT uses four specialist agents, each responsible for a narrow alignment subtask in code translation: one aligns source program behavior, one identifies execution divergences, one narrows the error space, and one proposes corrections. Because each agent owns exactly one subtask, the system can localize errors precisely — a mistake in the corrector does not contaminate the error-narrowing analysis.

CodeCoR applies iterative feedback within a centralized pipeline. Each output cycle feeds back as structured text input to the repair agent. The pipeline does not proceed linearly to completion; it loops until the output passes acceptance criteria. This is useful when acceptance criteria are checkable programmatically, such as passing a test suite or matching a schema.

Both patterns share the centralized property: a fixed structure, well-defined roles, deterministic execution order. They fit tasks where acceptance criteria are clear and the process can be specified in advance.

Centralized Architecture Trade-offs

AspectCentralized
Coordinator bottleneckHigh risk — all traffic flows through one point
DebuggingEasy — follow the coordinator’s decision log
Changing requirementsHard — rigid role structure does not accommodate mid-task pivots
Fault toleranceLow — coordinator failure halts the entire system
Cost modelPredictable — fixed number of calls per task
Best forDefined workflows, compliance requirements, repetitive structured tasks

Distributed Autonomous Architecture

In a distributed topology, there is no central coordinator. Each agent is autonomous: it decides which other agents to consult, which tools to invoke, how to route its own outputs, and when its portion of the task is complete. Agents organize themselves through peer-to-peer communication, shared memory, or graph-based connectivity.

The key properties invert relative to the centralized model:

  • Resilience. No single node failure brings down the system. Other agents continue operating and can redistribute work.
  • Flexibility. Agents adapt dynamically to changing task demands without restructuring the entire pipeline.
  • Emergent behavior. System-level capabilities arise from the interactions of individual agents, not from top-level orchestration.
  • Debugging difficulty. There is no single audit log. Tracing a failure requires reconstructing the distributed interaction history.

The Mesh Pattern

In a fully distributed system, agent topology looks like a mesh rather than a tree:

[Agent A] ——— [Agent B]
    \               \
  [Agent C] ——— [Agent D]
        \          /
       [Agent E]

Any agent can communicate with any other directly. Routing is emergent — agents find the path to the information or capability they need by querying the network, not by following a prescribed flow.

The mesh is the theoretical extreme. Most practical distributed systems use a directed structure to constrain the interaction space while preserving the resilience properties.

AgentNet: DAG-Based Decentralized MAS

AgentNet, presented at NeurIPS 2025, is the most fully realized distributed multi-agent architecture in the current research landscape. It replaces the central coordinator with a Directed Acyclic Graph (DAG) of agent relationships.

The key design decisions in AgentNet:

DAG structure. The directed acyclic graph prevents circular dependencies while preserving the flexible connectivity of a mesh. Agents can have multiple upstream and downstream connections, but execution cannot loop back. This gives the system the flexibility of a network without the risk of infinite loops.

RAG-based distributed memory. Each agent maintains its own Retrieval-Augmented Generation memory layer. Rather than sharing a central knowledge store, agents retrieve context from their local memory and exchange structured information with connected agents. This eliminates the memory bottleneck that limits centralized systems under high agent counts.

Auto-adjusting connectivity. Agents form and break connections based on real-time task demand. An agent facing a novel subtask it cannot handle alone will query connected agents for relevant capabilities and dynamically add a connection if a match is found. Connections that are no longer useful are pruned. The graph topology is not static — it evolves with the task.

Privacy-preserving cooperation. AgentNet is designed for inter-organization MAS where each participating organization controls its own agents and data. No central orchestrator sees all agent states. Cooperation happens through structured message passing with defined interfaces, not shared state. This makes AgentNet viable for multi-party scenarios — competitive organizations sharing a workflow without sharing proprietary information.

Emergent collective intelligence. Without a coordinator directing outcomes, the system converges on solutions through the aggregate behavior of its agents. No single agent plans the full solution; the plan emerges from the interactions.

Distributed Architecture Trade-offs

AspectDistributed
Coordinator bottleneckNone — fully decentralized
DebuggingHard — no single audit log; requires distributed tracing tooling
Changing requirementsFlexible — agents adapt connectivity dynamically
Fault toleranceHigh — individual agent failure does not halt the system
Cost modelVariable — agent interactions are emergent and harder to predict
Best forOpen-ended research tasks, dynamic environments, privacy-sensitive multi-party workflows

Security Considerations in Distributed MAS

Distributed architectures expand the attack surface significantly. In a centralized system, securing the coordinator and the coordinator-to-agent channels covers most of the threat model. In a distributed system, every agent-to-agent channel is a potential attack vector.

The threat categories unique to or amplified in distributed MAS:

Wormhole attack. A compromised agent is inserted into the network — either by exploiting a vulnerability in agent provisioning or by injecting malicious content into an agent’s tool results. The compromised agent relays manipulated information through the network, influencing downstream agents without triggering obvious failure signals. Because distributed systems rely on peer trust, a single compromised node can corrupt a large portion of the network’s working knowledge.

Denial of Service in MAS. In a centralized system, a DoS attack targets the coordinator. In a distributed system, flooding one agent with tasks starves its connected neighbors of compute resources. Because agents share nothing explicitly, the starvation effect can propagate non-obviously across the network topology.

Prompt injection via agent conversation context. An adversarial payload embedded in a tool result, external document, or API response is processed by Agent A and included in the message Agent A sends to Agent B. Agent B now has the malicious content in its context and may act on it as if it were legitimate instruction. This attack chain becomes more dangerous in distributed systems because the message propagation path is longer and less predictable than in a centralized pipeline.

Defense strategies:

  • Runtime permission evaluation. Each agent evaluates whether a requested action is within its defined permission scope before executing. Permissions are checked at runtime, not assumed from initialization state.
  • Sandboxed tool execution. Tools run in isolated environments. A compromised tool result cannot access the agent’s full context or communicate with other agents directly.
  • Message authentication between agents. Cryptographic signing of inter-agent messages allows each recipient to verify that the message originated from a known agent and has not been tampered with in transit.
  • Scope-limited agent roles. Even in a distributed system, constraining what each agent can do limits the blast radius of a compromise.

Security investment is not optional in distributed MAS. The flexibility that makes distributed systems powerful also makes them harder to secure systematically.


Choosing Your Topology

Use this decision tree as a starting point, not a final answer:

Is the task well-defined with fixed steps?
  → YES → Centralized hierarchical (ChatDev / MetaGPT pattern)
  → NO  → Is the task open-ended with dynamic sub-tasks?
    → YES → Distributed (AgentNet pattern)
    → NO  → Is human-in-the-loop required at checkpoints?
      → YES → Centralized with approval gates (LangGraph supervisor)
      → NO  → Hybrid: coordinator routes to autonomous specialist clusters

Three questions to ask before committing to a topology:

1. How well-defined is the task? If you can write a complete SOP for the task — inputs, steps, outputs, acceptance criteria — centralized will serve you well. If the task requires the system to discover what needs to be done as it proceeds, centralized will constrain you.

2. What are your fault tolerance requirements? A centralized system that processes 100 requests per day will fail completely when the coordinator crashes. A distributed system degrades gracefully. If uptime matters more than predictability, distribute.

3. Who owns the data? In single-organization deployments, shared state through a coordinator is acceptable. In multi-party workflows where data sovereignty is a requirement, distributed architectures like AgentNet are the only viable option.


The Hybrid Middle Ground

Most production systems that handle real-world complexity use neither pure centralized nor pure distributed architecture. They use a hybrid: a lightweight coordinator that handles top-level routing, with semi-autonomous specialist clusters handling sub-task execution independently.

LangGraph popularized this pattern. A supervisor node receives the goal and routes to worker nodes. Each worker node operates autonomously within its scope — it can loop, call tools, retrieve context, and revise its output without checking back with the supervisor at every step. The supervisor re-enters the picture when a worker signals completion or requests guidance.

The hybrid gives you:

  • Predictable top-level routing. The supervisor maintains the system-wide task state, making the overall execution traceable.
  • Flexible sub-task execution. Workers handle the messy, iterative parts of their sub-tasks without burdening the supervisor with every intermediate step.
  • Configurable fault tolerance. If a worker fails, the supervisor can route the sub-task to a backup or request human intervention without losing the entire task state.

For most teams building production AI agent systems in 2026, the hybrid is the right starting point. Begin with a clear supervisor and well-scoped worker clusters, then push more autonomy into the workers as you build confidence in their behavior.


Frequently Asked Questions

Can a centralized MAS scale to thousands of agents?

Technically possible, but impractical without significant engineering investment. The coordinator becomes a throughput bottleneck as the agent count grows — every agent’s output must route through a single node before the next step can proceed. Systems like ChatDev and MetaGPT work well at the scale of tens of agents but show throughput degradation beyond that.

To scale a centralized system, teams typically shard the coordinator into multiple domain-specific coordinators (one per functional area) with a lightweight top-level router. This is effectively a hybrid architecture. For true horizontal scale, distributed topologies handle agent counts in the hundreds to thousands more naturally because there is no single choke point.

Is AgentNet / distributed MAS production-ready in 2026?

AgentNet demonstrated strong results at NeurIPS 2025, but the pattern is still maturing in production deployments. The core challenges are operational: distributed tracing tooling for MAS is less mature than centralized orchestration observability, and the emergent behavior of distributed systems is harder to test exhaustively than deterministic pipelines.

Teams with strong distributed systems engineering backgrounds and explicit requirements for fault tolerance or data sovereignty should evaluate it seriously. Teams earlier in their AI agent journey will generally find the hybrid pattern more tractable. Expect the tooling ecosystem around distributed MAS to mature significantly over the next 12 to 18 months.

How does topology affect cost?

Centralized architectures have more predictable cost profiles. Because the coordinator controls sequencing, you can reason statically about the number of LLM calls per task and bound the cost per execution.

Distributed architectures have variable cost. Agents form and break connections dynamically, and the number of inter-agent messages per task depends on runtime behavior rather than a fixed plan. Emergent interactions can produce more LLM calls than a centralized equivalent, especially in early iterations before the system’s behavior is well-understood.

Budget accordingly: centralized for cost-sensitive or cost-predictable workloads, distributed for workloads where flexibility and resilience justify higher and variable per-task cost.

What is the right topology for a coding assistant?

It depends on the scope. For a coding assistant that handles well-defined tasks — generate a function, explain this code, write a test for this class — a centralized pattern with a router and two to three specialist agents (generator, reviewer, tester) is clean, predictable, and cheap.

For a coding assistant that handles open-ended tasks — build a feature from a vague spec, debug an unfamiliar codebase, architect a new service — a hybrid works better. A supervisor routes the high-level task, and specialist clusters (codebase explorer, implementation agent, testing agent, documentation agent) work semi-autonomously within their scope, surfacing results to the supervisor when they need coordination.

Systems like AutoGen’s group chat mode implement exactly this: a lightweight coordinator managing conversation flow between specialist agents that each operate within their defined capabilities. See Mastering AutoGen Group Chat for Collaborative AI Workflows for a detailed implementation guide.


Next Steps

The topology you choose sets the stage for everything else in your multi-agent system. Once you have the right structure, the next layer is orchestration: how tasks flow between agents, how outputs are validated, and how the system recovers from partial failures.

Related Articles