Paper Overview
“ReAct: Synergizing Reasoning and Acting in Language Models” Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y. (2022) Princeton University / Google Brain International Conference on Learning Representations (ICLR 2023)
Why it matters: ReAct is the reasoning framework underlying most production AI agent systems. LangChain’s create_react_agent, LlamaIndex’s ReActAgent, and AutoGPT all implement variations of ReAct. Understanding the paper explains why agents reason the way they do.
The Problem: Reasoning Without Acting, Acting Without Reasoning
In 2022, two separate lines of research existed:
-
Chain-of-Thought (CoT) prompting — LLMs reason step by step but only operate on their internal knowledge. They can’t look things up or take actions.
-
Action-only agents (MRKL, Toolformer) — LLMs call external tools but with minimal explicit reasoning. They often fail on complex multi-step tasks.
The authors’ hypothesis: interleaving reasoning traces with actions creates a synergy better than either alone.
The ReAct Framework
ReAct generates both reasoning traces and task-specific actions in an interleaved manner:
Thought: [reasoning about the current state and what to do next]
Action: [tool_name][input]
Observation: [result from the tool]
... (repeat)
Thought: I now have enough information to answer
Final Answer: [answer]
Key insight: The Thought step serves a dual purpose:
- Reasoning — helps the LLM decompose problems, track progress, handle exceptions
- Grounding — makes the action choice interpretable (and debuggable)
The Observation grounds the LLM’s next thought in real-world information, preventing hallucination.
A Concrete Example from the Paper
Task: “Aside from the Apple Remote, what other devices can control the program Apple Remote was originally designed to work with?”
Without ReAct (pure CoT):
Apple Remote was originally designed for Front Row.
Front Row can be controlled by: keyboard, mouse, Apple Remote.
Other devices: keyboard, mouse.
WRONG — the model hallucinated. It doesn’t know the real answer.
With ReAct:
Thought: I need to find what Apple Remote was originally designed to work with.
Action: Search[Apple Remote]
Observation: The Apple Remote is a remote control device. It was initially designed
to work with the Front Row media application on Mac computers.
Thought: Now I need to find what other devices can control Front Row.
Action: Search[Front Row controls]
Observation: Front Row can be controlled using an Apple Remote or the keyboard.
Thought: So besides Apple Remote, keyboards can control Front Row.
Action: Finish[keyboard]
CORRECT — grounded reasoning with real retrieved information.
What the Paper Tested
The authors evaluated ReAct on three benchmarks:
1. HotpotQA (Multi-hop question answering)
Questions requiring 2-3 Wikipedia lookups and reasoning across them.
| Method | EM Score |
|---|---|
| Act only (no reasoning) | 25.7 |
| CoT only (no tools) | 29.4 |
| ReAct | 27.4 |
| ReAct + CoT (best-of-n) | 35.1 |
2. FEVER (Fact verification)
Verify whether a claim is Supported, Refuted, or Not Enough Info using Wikipedia.
| Method | Accuracy |
|---|---|
| Act only | 58.9% |
| CoT only | 56.3% |
| ReAct | 60.9% |
3. ALFWorld (Text-based game)
Interactive household tasks: “find a soapbar and clean it”
| Method | Success Rate |
|---|---|
| Act only | 45% |
| ReAct | 71% |
The biggest gains were in environments requiring multi-step planning with real-world state.
Why Interleaving Works: The Paper’s Analysis
The authors identify three key failure modes that ReAct addresses:
1. Hallucination — pure CoT models “remember” facts that may be wrong. ReAct grounds each reasoning step in retrieved information.
2. Dead loops — action-only agents repeat the same search when stuck. Explicit reasoning helps the agent recognize the loop and try a different approach.
3. Error propagation — without reasoning, a wrong action leads silently to a wrong answer. ReAct’s Thought traces make errors visible and allow self-correction.
How Modern Frameworks Implement ReAct
LangChain
from langchain.agents import create_react_agent, AgentExecutor
from langchain import hub
from langchain_openai import ChatOpenAI
from langchain.tools import Tool
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
# LangChain provides a standard ReAct prompt template
react_prompt = hub.pull("hwchase17/react")
# Prompt includes: "Thought:", "Action:", "Action Input:", "Observation:" format
tools = [
Tool(
name="search",
func=lambda q: f"Search result for: {q}",
description="Search the web for current information",
)
]
agent = create_react_agent(llm, tools, react_prompt)
executor = AgentExecutor(agent=agent, tools=tools, verbose=True)
result = executor.invoke({"input": "What is the current price of Bitcoin?"})
# Prints the full Thought/Action/Observation chain
LlamaIndex
from llama_index.core.agent import ReActAgent
from llama_index.core.tools import FunctionTool
from llama_index.llms.openai import OpenAI
def multiply(a: float, b: float) -> float:
"""Multiply two numbers. Returns the product."""
return a * b
tool = FunctionTool.from_defaults(fn=multiply)
agent = ReActAgent.from_tools(
[tool],
llm=OpenAI(model="gpt-4o-mini"),
verbose=True, # shows Thought/Action/Observation
max_iterations=10,
)
Both implementations generate the Thought → Action → Observation loop described in the paper, but abstract the parsing and tool dispatch from the developer.
Limitations Identified in the Paper
Context length — As reasoning chains grow, they consume more of the context window. Long multi-step tasks can hit limits.
Hallucination still possible — ReAct reduces but doesn’t eliminate hallucination. The reasoning trace itself can be wrong.
Not always better than CoT — For tasks where the LLM has accurate internal knowledge, the added latency of tool calls isn’t worth it. ReAct shines when external information is needed.
Depends on good search — The quality of tool results directly impacts reasoning quality. Bad retrieval → bad answers, even with perfect reasoning.
What Came After ReAct
ReAct was influential enough to spawn several extensions:
| Paper | Extension |
|---|---|
| Reflexion (2023) | Adds verbal self-reflection to improve from past mistakes |
| Plan-and-Solve (2023) | Explicit planning before acting |
| Tree of Thoughts (2023) | Branching thought exploration |
| OpenHands/SWE-Agent (2024) | ReAct applied to software engineering tasks |
All of these build on the core ReAct insight: explicit reasoning + external grounding = more reliable agents.
Frequently Asked Questions
Is ReAct still the best approach for agents?
For general-purpose agents: yes, most production systems use ReAct or derivatives. For highly structured workflows, code-generation agents (like OpenHands) that write and execute code directly often outperform pure ReAct on software tasks.
What’s the difference between ReAct and function calling?
ReAct is a prompting strategy — the reasoning format is in the prompt and the LLM generates “Action: tool_name” as text. Function calling (OpenAI, Anthropic) is a native model capability where the LLM returns structured JSON for tool calls. Modern frameworks use function calling under the hood but still follow ReAct-style reasoning.
How do I make ReAct agents more reliable?
- Write clear tool descriptions (the LLM decides which tool to use based on these)
- Set
max_iterationsto prevent infinite loops - Use
temperature=0for tool selection - Add explicit error handling instructions in the system prompt
- Use GPT-4o or Claude over gpt-4o-mini for complex multi-step tasks
Can I see the Thought traces in production?
Yes — most frameworks have a verbose=True flag. In production, log these traces for debugging without showing them to end users. The reasoning chain is often the most useful debugging artifact when agents fail.
Next Steps
- Chain of Thought Prompting Explained — The CoT foundation ReAct builds on
- What Is an AI Agent? — Conceptual overview
- LangChain Agents and Tools — Build a ReAct agent in LangChain