Intermediate Papers 5 min read

ReAct: Reasoning and Acting — The Paper Behind Agent Frameworks

#react #reasoning #acting #agents #llm #paper #langchain #tool-use

Paper Overview

“ReAct: Synergizing Reasoning and Acting in Language Models” Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y. (2022) Princeton University / Google Brain International Conference on Learning Representations (ICLR 2023)

arxiv.org/abs/2210.03629

Why it matters: ReAct is the reasoning framework underlying most production AI agent systems. LangChain’s create_react_agent, LlamaIndex’s ReActAgent, and AutoGPT all implement variations of ReAct. Understanding the paper explains why agents reason the way they do.


The Problem: Reasoning Without Acting, Acting Without Reasoning

In 2022, two separate lines of research existed:

  1. Chain-of-Thought (CoT) prompting — LLMs reason step by step but only operate on their internal knowledge. They can’t look things up or take actions.

  2. Action-only agents (MRKL, Toolformer) — LLMs call external tools but with minimal explicit reasoning. They often fail on complex multi-step tasks.

The authors’ hypothesis: interleaving reasoning traces with actions creates a synergy better than either alone.


The ReAct Framework

ReAct generates both reasoning traces and task-specific actions in an interleaved manner:

Thought: [reasoning about the current state and what to do next]
Action: [tool_name][input]
Observation: [result from the tool]
... (repeat)
Thought: I now have enough information to answer
Final Answer: [answer]

Key insight: The Thought step serves a dual purpose:

  1. Reasoning — helps the LLM decompose problems, track progress, handle exceptions
  2. Grounding — makes the action choice interpretable (and debuggable)

The Observation grounds the LLM’s next thought in real-world information, preventing hallucination.


A Concrete Example from the Paper

Task: “Aside from the Apple Remote, what other devices can control the program Apple Remote was originally designed to work with?”

Without ReAct (pure CoT):

Apple Remote was originally designed for Front Row.
Front Row can be controlled by: keyboard, mouse, Apple Remote.
Other devices: keyboard, mouse.

WRONG — the model hallucinated. It doesn’t know the real answer.

With ReAct:

Thought: I need to find what Apple Remote was originally designed to work with.
Action: Search[Apple Remote]
Observation: The Apple Remote is a remote control device. It was initially designed
             to work with the Front Row media application on Mac computers.

Thought: Now I need to find what other devices can control Front Row.
Action: Search[Front Row controls]
Observation: Front Row can be controlled using an Apple Remote or the keyboard.

Thought: So besides Apple Remote, keyboards can control Front Row.
Action: Finish[keyboard]

CORRECT — grounded reasoning with real retrieved information.


What the Paper Tested

The authors evaluated ReAct on three benchmarks:

1. HotpotQA (Multi-hop question answering)

Questions requiring 2-3 Wikipedia lookups and reasoning across them.

MethodEM Score
Act only (no reasoning)25.7
CoT only (no tools)29.4
ReAct27.4
ReAct + CoT (best-of-n)35.1

2. FEVER (Fact verification)

Verify whether a claim is Supported, Refuted, or Not Enough Info using Wikipedia.

MethodAccuracy
Act only58.9%
CoT only56.3%
ReAct60.9%

3. ALFWorld (Text-based game)

Interactive household tasks: “find a soapbar and clean it”

MethodSuccess Rate
Act only45%
ReAct71%

The biggest gains were in environments requiring multi-step planning with real-world state.


Why Interleaving Works: The Paper’s Analysis

The authors identify three key failure modes that ReAct addresses:

1. Hallucination — pure CoT models “remember” facts that may be wrong. ReAct grounds each reasoning step in retrieved information.

2. Dead loops — action-only agents repeat the same search when stuck. Explicit reasoning helps the agent recognize the loop and try a different approach.

3. Error propagation — without reasoning, a wrong action leads silently to a wrong answer. ReAct’s Thought traces make errors visible and allow self-correction.


How Modern Frameworks Implement ReAct

LangChain

from langchain.agents import create_react_agent, AgentExecutor
from langchain import hub
from langchain_openai import ChatOpenAI
from langchain.tools import Tool

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# LangChain provides a standard ReAct prompt template
react_prompt = hub.pull("hwchase17/react")
# Prompt includes: "Thought:", "Action:", "Action Input:", "Observation:" format

tools = [
    Tool(
        name="search",
        func=lambda q: f"Search result for: {q}",
        description="Search the web for current information",
    )
]

agent = create_react_agent(llm, tools, react_prompt)
executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

result = executor.invoke({"input": "What is the current price of Bitcoin?"})
# Prints the full Thought/Action/Observation chain

LlamaIndex

from llama_index.core.agent import ReActAgent
from llama_index.core.tools import FunctionTool
from llama_index.llms.openai import OpenAI

def multiply(a: float, b: float) -> float:
    """Multiply two numbers. Returns the product."""
    return a * b

tool = FunctionTool.from_defaults(fn=multiply)

agent = ReActAgent.from_tools(
    [tool],
    llm=OpenAI(model="gpt-4o-mini"),
    verbose=True,  # shows Thought/Action/Observation
    max_iterations=10,
)

Both implementations generate the Thought → Action → Observation loop described in the paper, but abstract the parsing and tool dispatch from the developer.


Limitations Identified in the Paper

Context length — As reasoning chains grow, they consume more of the context window. Long multi-step tasks can hit limits.

Hallucination still possible — ReAct reduces but doesn’t eliminate hallucination. The reasoning trace itself can be wrong.

Not always better than CoT — For tasks where the LLM has accurate internal knowledge, the added latency of tool calls isn’t worth it. ReAct shines when external information is needed.

Depends on good search — The quality of tool results directly impacts reasoning quality. Bad retrieval → bad answers, even with perfect reasoning.


What Came After ReAct

ReAct was influential enough to spawn several extensions:

PaperExtension
Reflexion (2023)Adds verbal self-reflection to improve from past mistakes
Plan-and-Solve (2023)Explicit planning before acting
Tree of Thoughts (2023)Branching thought exploration
OpenHands/SWE-Agent (2024)ReAct applied to software engineering tasks

All of these build on the core ReAct insight: explicit reasoning + external grounding = more reliable agents.


Frequently Asked Questions

Is ReAct still the best approach for agents?

For general-purpose agents: yes, most production systems use ReAct or derivatives. For highly structured workflows, code-generation agents (like OpenHands) that write and execute code directly often outperform pure ReAct on software tasks.

What’s the difference between ReAct and function calling?

ReAct is a prompting strategy — the reasoning format is in the prompt and the LLM generates “Action: tool_name” as text. Function calling (OpenAI, Anthropic) is a native model capability where the LLM returns structured JSON for tool calls. Modern frameworks use function calling under the hood but still follow ReAct-style reasoning.

How do I make ReAct agents more reliable?

  1. Write clear tool descriptions (the LLM decides which tool to use based on these)
  2. Set max_iterations to prevent infinite loops
  3. Use temperature=0 for tool selection
  4. Add explicit error handling instructions in the system prompt
  5. Use GPT-4o or Claude over gpt-4o-mini for complex multi-step tasks

Can I see the Thought traces in production?

Yes — most frameworks have a verbose=True flag. In production, log these traces for debugging without showing them to end users. The reasoning chain is often the most useful debugging artifact when agents fail.


Next Steps

Related Articles