A Survey of LLM-based Autonomous Agents: Paper Explained

Q: What's the difference between planning with and without feedback?

Planning without feedback (like Chain-of-Thought) generates a complete plan in one shot and executes it linearly. It's fast and cheap but brittle — if step 3 fails, the whole plan breaks. Planning with feedback (like ReAct) generates one step at a time, executes it, observes the result, and then generates the next step. It's slower and more expensive but dramatically more robust. For any task involving real-world tools or APIs, always use feedback-based planning.

The landmark paper “A Survey of LLM-based Autonomous Agents” (Wang et al., 2023) is one of the most cited references in the AI Agent field — and for good reason. It synthesizes hundreds of research papers into a single coherent framework, giving developers a mental model for how autonomous agents are actually built. This article explains the survey paper in plain English, breaks down its core architecture taxonomy, and shows you how to map those concepts to real code.

If you’ve been reading about ReAct, Tree of Thought, or other advanced reasoning patterns and wondered how they all fit together, this paper is the map you’ve been missing.

Why This Survey Paper Matters

Before 2023, the AI Agent landscape was a collection of disconnected experiments — AutoGPT, BabyAGI, HuggingGPT, and dozens of others each solving the problem their own way. The survey paper’s contribution was taxonomic: it looked at all of these systems and asked, what do they have in common?

The answer turned out to be remarkably consistent. Nearly every LLM-based autonomous agent can be decomposed into four subsystems:

Profile — Who is the agent?
Memory — What does the agent remember?
Planning — How does the agent decide what to do?
Action — What can the agent actually do?

The paper calls this the Construction-Application-Evaluation framework. The “construction” phase defines the agent’s architecture. The “application” phase covers deployment domains (social simulation, software engineering, scientific research). The “evaluation” phase discusses benchmarks and failure modes.

For working developers, the construction framework is what matters most — it’s the blueprint every production agent system follows.

The Four-Module Architecture

Let’s go deep on each module, using the paper’s definitions.

Profile Module

The profile module defines the agent’s identity and role. This is implemented through the system prompt: you specify the agent’s name, occupation, personality, constraints, and goals.

The paper identifies three profiling strategies:

Handcrafted profiles — manually written personas (most common in production)
LLM-generated profiles — the LLM generates its own profile from a seed description
Dataset-aligned profiles — profiles derived from real-world data (e.g., census data for social simulations)

For most developers, handcrafted profiles are the right starting point. The gstack Gears and Personas approach is a practical implementation of exactly this concept — assigning structured roles to agents to improve output quality and reduce hallucination.

Memory Module

Memory in LLM agents is how agents retain information across turns and tasks. The paper classifies memory into five types:

Type	Description	Implementation
Sensory Memory	Raw input in current context	Prompt window
Short-Term Memory	Working context for current task	In-context messages
Long-Term Memory	Persistent knowledge across sessions	Vector database / external storage
Episodic Memory	Past experiences and interactions	Logged conversation history
Semantic Memory	General world knowledge	LLM weights + RAG

The most important architectural decision you’ll make is how to implement long-term memory. The paper strongly favors Retrieval-Augmented Generation (RAG) as the standard solution — if you’re not familiar with RAG, the What Is RAG? guide is essential reading before building production agents.

Planning Module

Planning is how an agent decomposes a complex goal into executable steps. The survey identifies two major planning paradigms:

Planning without feedback — the agent generates a full plan upfront (like a waterfall process). Examples: Chain-of-Thought, Tree of Thoughts.
Planning with feedback — the agent plans, acts, observes results, and replans. Examples: ReAct, Reflexion, DSPY.

The paper argues that planning with feedback is almost always superior for complex tasks because real-world environments are unpredictable. A plan that looks correct in theory frequently breaks on contact with reality.

Action Module

The action module translates the agent’s decisions into real-world effects. The paper categorizes actions into:

Task-completion actions — calling tools, writing files, executing code
Communication actions — sending messages to humans or other agents
Memory manipulation actions — storing or retrieving from memory systems

The action space is what distinguishes a chatbot from an autonomous agent. A chatbot only communicates; an agent can do things.

Architecture Overview

Here’s how the four modules interact in a typical agent loop:

flowchart TD
    Input([User Goal]) --> Profile[Profile Module\nRole + Constraints]
    Profile --> Planning[Planning Module\nDecompose into Steps]
    Planning --> Memory[(Memory Module\nShort + Long Term)]
    Memory --> Planning
    Planning --> Action[Action Module\nTool Calls / Output]
    Action --> Observe[Observe Result]
    Observe --> Planning
    Observe --> Output([Final Response])

    style Profile fill:#E1F5EE,stroke:#085041,color:#085041
    style Memory fill:#FAEEDA,stroke:#633806,color:#633806
    style Planning fill:#E1F5EE,stroke:#085041,color:#085041
    style Action fill:#FCEBEB,stroke:#501313,color:#501313

This loop is sometimes called the Observe-Plan-Act cycle or the ReAct loop. The key insight from the survey is that all the seemingly different agent frameworks are variations of this same core cycle — they differ in how they implement each module, not in the overall structure.

Implementing the Framework in Code

Let’s build a minimal agent that demonstrates all four modules. This is a simplified but fully runnable implementation using the Anthropic SDK.

pip install anthropic

import anthropic
import json
from datetime import datetime

# ── Profile Module ──────────────────────────────────────────────
SYSTEM_PROMPT = """You are a research assistant agent specializing in AI papers.

Your capabilities:
- Search for information using the search tool
- Summarize findings clearly for developers
- Always cite sources when possible

Constraints:
- Only answer questions related to AI/ML research
- If uncertain, say so rather than hallucinating
"""

# ── Memory Module ───────────────────────────────────────────────
class AgentMemory:
    def __init__(self, max_short_term: int = 10):
        self.short_term: list[dict] = []   # in-context conversation
        self.long_term: list[dict] = []    # persisted across sessions
        self.max_short_term = max_short_term

    def add(self, role: str, content: str):
        self.short_term.append({"role": role, "content": content})
        if len(self.short_term) > self.max_short_term:
            # Move oldest message to long-term (simplified — use a vector DB in production)
            self.long_term.append(self.short_term.pop(0))

    def get_context(self) -> list[dict]:
        return self.short_term.copy()

# ── Action Module — Tool Definitions ───────────────────────────
tools = [
    {
        "name": "search_papers",
        "description": "Search for AI research papers by topic or title.",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "The search query for finding papers"
                }
            },
            "required": ["query"]
        }
    },
    {
        "name": "get_current_date",
        "description": "Returns today's date. Use this when asked about recency.",
        "input_schema": {
            "type": "object",
            "properties": {}
        }
    }
]

def execute_tool(tool_name: str, tool_input: dict) -> str:
    """Action execution layer — maps tool names to real implementations."""
    if tool_name == "search_papers":
        # In production, integrate with Semantic Scholar or arXiv API
        query = tool_input.get("query", "")
        return json.dumps({
            "results": [
                {
                    "title": "A Survey on Large Language Model based Autonomous Agents",
                    "authors": "Wang et al.",
                    "year": 2023,
                    "summary": "Comprehensive taxonomy of LLM agent architectures covering profile, memory, planning, and action modules."
                }
            ],
            "query": query
        })
    elif tool_name == "get_current_date":
        return datetime.now().strftime("%Y-%m-%d")
    return "Tool not found"

# ── Planning Module — The Agent Loop ───────────────────────────
class AutonomousAgent:
    def __init__(self):
        self.client = anthropic.Anthropic()
        self.memory = AgentMemory()
        self.model = "claude-opus-4-6"

    def run(self, user_input: str) -> str:
        """Main agent loop: Plan → Act → Observe → Replan."""
        self.memory.add("user", user_input)

        while True:
            # Planning step: ask the LLM what to do next
            response = self.client.messages.create(
                model=self.model,
                max_tokens=1024,
                system=SYSTEM_PROMPT,
                tools=tools,
                messages=self.memory.get_context()
            )

            # Check if the agent wants to use a tool (Act)
            if response.stop_reason == "tool_use":
                tool_results = []
                for block in response.content:
                    if block.type == "tool_use":
                        print(f"  [Agent] Using tool: {block.name}({block.input})")
                        result = execute_tool(block.name, block.input)
                        tool_results.append({
                            "type": "tool_result",
                            "tool_use_id": block.id,
                            "content": result
                        })

                # Add assistant response + tool results to memory (Observe)
                self.memory.add("assistant", response.content)
                self.memory.add("user", tool_results)

            elif response.stop_reason == "end_turn":
                # Agent has finished planning and acting — extract final text
                final_text = next(
                    (block.text for block in response.content if hasattr(block, "text")),
                    "No response generated."
                )
                self.memory.add("assistant", final_text)
                return final_text

            else:
                return f"Unexpected stop reason: {response.stop_reason}"

# ── Entry Point ─────────────────────────────────────────────────
if __name__ == "__main__":
    agent = AutonomousAgent()

    questions = [
        "What are the four main modules in the LLM agent survey paper?",
        "When did you last search? What date is it today?"
    ]

    for question in questions:
        print(f"\nUser: {question}")
        answer = agent.run(question)
        print(f"Agent: {answer}")

Run it:

export ANTHROPIC_API_KEY=your_key_here
python agent_survey.py

This code demonstrates all four modules from the survey:

Profile → SYSTEM_PROMPT
Memory → AgentMemory class with short-term/long-term separation
Planning → the while True loop with LLM-driven decision making
Action → execute_tool() with the tool dispatch logic

Key Findings and Limitations

The survey is honest about what LLM agents cannot do well yet. The most important limitations the authors identify:

Hallucination compounds through planning. Each step in a multi-step plan is an opportunity for the LLM to hallucinate. In a 10-step plan, small errors accumulate into completely wrong outcomes. This is why grounding through tools (like the search_papers tool above) is critical — don’t trust the LLM’s internal “knowledge” for facts.

Long-horizon tasks remain unsolved. Most agent benchmarks test tasks completable in 5–20 steps. Real-world tasks (refactoring a codebase, running a research project) require hundreds of steps. No current architecture reliably handles this. Projects like AutoGen’s group chat system and Getting Started with AutoGen are attempts to divide this complexity across specialized agents.

Evaluation is immature. The paper notes that most agent benchmarks test toy environments that don’t reflect production complexity. If your agent scores well on WebArena or ALFWorld, that’s a good sign — but it doesn’t guarantee production reliability.

Token costs scale non-linearly. A ReAct loop that runs 20 iterations with a large context window can consume 100x the tokens of a single-turn response. Design your memory module to aggressively prune irrelevant context.

Frequently Asked Questions

What is the main contribution of the LLM agent survey paper?

The paper’s primary contribution is a unified taxonomy. Before it was published, every agent system used different terminology and framing. The survey introduced the Profile-Memory-Planning-Action framework, which became the standard vocabulary for discussing agent architectures. It also catalogued over 150 existing agent systems, categorized their capabilities, and identified open research challenges — making it the go-to reference for anyone entering the field.

Is the survey paper still relevant given how fast AI is moving?

Yes, more than ever. The core architectural insight — that agents need profile, memory, planning, and action modules — has held up across GPT-4o, Claude 3.5+, and Gemini 1.5. What changes is the quality of each module as models improve. The framework remains the correct lens; only the implementations evolve. If anything, newer frameworks like LangGraph and CrewAI have converged more tightly on the paper’s taxonomy than earlier systems had.

What’s the difference between planning with and without feedback?

Planning without feedback (like Chain-of-Thought) generates a complete plan in one shot and executes it linearly. It’s fast and cheap but brittle — if step 3 fails, the whole plan breaks.

Planning with feedback (like ReAct) generates one step at a time, executes it, observes the result, and then generates the next step. It’s slower and more expensive but dramatically more robust. For any task involving real-world tools or APIs, always use feedback-based planning.

How does long-term memory work in practice?

In the paper’s framework, long-term memory is external to the LLM — stored in a database (usually a vector database for semantic search). When the agent starts a new task, it queries long-term memory to retrieve relevant past experiences and injects them into the context window. This is architecturally identical to RAG: the “documents” are previous agent experiences rather than external text. Projects like Letta (formerly MemGPT) are specifically designed to make this memory architecture manageable at scale.

Should I read the full paper or is this explanation enough?

For getting started and building agents, this explanation covers what you need. Read the full paper if you’re: (a) doing research that needs to cite prior work, (b) building a novel agent system and want to position your contribution accurately, or (c) evaluating agent architectures for a production system and need the full comparison table of existing systems. The paper is freely available on arXiv and well-written — the introduction and Section 3 (Construction) are the highest-value sections for practitioners.