Paper Overview
“Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., Zhou, D. (2022) Google Brain Advances in Neural Information Processing Systems (NeurIPS 2022)
Why it matters: Chain-of-Thought (CoT) is the most widely used prompting technique in AI development today. Every agent that “thinks step by step” uses CoT. It’s the bedrock on which ReAct, Tree of Thoughts, and almost all advanced prompting methods are built.
The Problem: LLMs Fail at Multi-Step Reasoning
Before this paper, it was well-known that LLMs struggled with problems requiring multiple reasoning steps:
Question: "Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 balls. How many tennis balls does he have now?"
Standard prompt answer: 11 ← WRONG
Correct answer: 11 ← Actually right here, but fails on harder problems
For more complex arithmetic, commonsense reasoning, and symbolic manipulation, large LLMs would fail even though they “knew” the individual steps. The issue: they couldn’t chain intermediate steps together.
The Key Finding: Just Show the Reasoning
The paper’s central contribution is almost embarrassingly simple:
If you include a reasoning chain in your few-shot examples, the model learns to produce reasoning chains too.
# Standard few-shot prompting (no CoT):
"""
Q: Roger has 5 tennis balls. He buys 2 cans. Each can has 3 balls.
How many balls does he have now?
A: 11
Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more,
how many do they have?
A: ???
"""
# Chain-of-Thought prompting:
"""
Q: Roger has 5 tennis balls. He buys 2 cans. Each can has 3 balls.
How many balls does he have now?
A: Roger starts with 5 balls. 2 cans × 3 balls = 6 balls from the cans.
5 + 6 = 11. The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more,
how many do they have?
A: ???
"""
With CoT examples in the prompt, the model generates its own intermediate reasoning steps. This dramatically improves accuracy.
Results: Scale is Critical
One of the paper’s most important findings: CoT only helps at scale.
Results on GSM8K (grade school math word problems):
| Model | Standard Prompting | CoT Prompting |
|---|---|---|
| GPT-2 (1.5B) | 1.0% | 1.1% |
| GPT-3 6B | 1.4% | 1.3% |
| GPT-3 175B | 17.9% | 56.4% |
| PaLM 540B | ~21% | 58% |
The inflection point is around 100B parameters. Below that, CoT actually hurts — the model generates plausible-sounding but wrong reasoning. Above it, CoT unlocks reasoning capabilities that weren’t accessible through standard prompting.
This finding has important implications: CoT is not a free lunch for small models. If you’re using a 7B parameter local model, CoT may not help — or may even be counterproductive.
Two Forms of CoT: Few-Shot and Zero-Shot
Few-Shot CoT (Original Paper)
The original technique: provide hand-crafted examples with reasoning:
system_prompt = """Solve math problems step by step.
Q: There are 15 trees in the grove. Grove workers will plant trees today.
After they are done, there will be 21 trees. How many trees did the workers plant?
A: We start with 15 trees. After planting, there are 21. So 21 - 15 = 6 trees were planted.
Q: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars?
A: There are 3 cars originally. 2 more arrive. 3 + 2 = 5 cars.
Q: {question}
A:"""
Zero-Shot CoT (“Let’s think step by step”)
A follow-up paper (Kojima et al., 2022) showed that simply appending “Let’s think step by step” works almost as well as few-shot CoT — without needing hand-crafted examples:
def ask_with_cot(question: str) -> str:
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "user",
"content": f"{question}\n\nLet's think step by step."
}
]
)
return response.choices[0].message.content
result = ask_with_cot(
"A bat and ball cost $1.10 in total. The bat costs $1 more than the ball. "
"How much does the ball cost?"
)
# Without CoT: $0.10 (wrong — the intuitive but incorrect answer)
# With CoT: "The ball costs x. The bat costs x + 1. Together: 2x + 1 = 1.10, so x = 0.05. The ball costs $0.05." (correct)
The “Let’s think step by step” phrase works because it’s pattern-matched from training data where this phrasing precedes careful reasoning.
Why CoT Works: The Authors’ Explanation
The paper offers several hypotheses:
-
Computation budget — sequential reasoning steps allow the model to use more “computation” (more tokens) before outputting the answer. The number of forward passes scales with output length.
-
Interpretability of intermediate steps — each reasoning step can be verified. Wrong steps become visible, and the model can self-correct (in models that support this).
-
Training data alignment — math textbooks, coding tutorials, and logical arguments in training data naturally have intermediate steps. CoT triggers these learned patterns.
Limitations
Self-consistency issues — CoT reasoning can contain errors that lead to wrong answers. The model sounds confident but the logic is flawed.
Hallucinated reasoning — The model may generate a plausible-sounding but factually incorrect reasoning chain. The reasoning “looks right” but uses wrong facts.
Task-specific — CoT helps most with arithmetic, symbolic reasoning, commonsense reasoning. For simple factual recall, it often doesn’t help (and wastes tokens).
Small model performance — As noted above, CoT degrades performance in sub-100B models.
Extensions That Built on CoT
| Technique | Innovation |
|---|---|
| Self-Consistency (Wang et al., 2022) | Sample multiple CoT paths, take majority vote |
| Least-to-Most Prompting (Zhou et al., 2022) | Decompose into sub-problems, solve sequentially |
| Tree of Thoughts (Yao et al., 2023) | Explore multiple reasoning branches, backtrack |
| ReAct (Yao et al., 2022) | Combine CoT with tool use |
| Program of Thoughts (Chen et al., 2022) | Generate Python code instead of natural language reasoning |
Practical Takeaways for Developers
# 1. For complex tasks, always use CoT
system = """When solving problems:
1. Identify what information you have
2. List the steps needed
3. Work through each step explicitly
4. Verify the answer makes sense"""
# 2. For structured output, request thinking first
prompt = """
Question: {question}
First, think through this step by step, then provide your answer.
Thinking:
[your reasoning here]
Answer:
[final answer here]
"""
# 3. For very complex reasoning, use "scratchpad" approach
prompt = """
<scratchpad>
Think through this problem carefully before answering.
</scratchpad>
{question}
"""
# 4. Self-consistency: for high-stakes decisions, sample multiple times
def self_consistent_answer(question: str, n: int = 5) -> str:
answers = [ask_with_cot(question) for _ in range(n)]
# Extract final answers and take the majority
# (implementation depends on answer format)
from collections import Counter
final_answers = [a.split("Answer:")[-1].strip() for a in answers]
return Counter(final_answers).most_common(1)[0][0]
Frequently Asked Questions
Does CoT work with every model?
Reliably with GPT-4 class models (and Claude, Gemini equivalents). Partial improvement with GPT-3.5. Unreliable with smaller models. Always test on your specific model.
Should I always add “Let’s think step by step”?
For complex multi-step reasoning: yes. For simple lookups or factual questions: no — it adds tokens without benefit. For creative tasks: sometimes, when you want structured output.
Is CoT the same as “scratchpad” prompting?
Very similar. “Scratchpad” typically refers to a designated space for reasoning before the final answer. CoT is the broader technique of generating intermediate reasoning. Both encourage the same behavior.
What is “self-consistency” and when should I use it?
Self-consistency means running the same CoT prompt multiple times (with temperature > 0) and taking the majority answer. It improves accuracy by 10–20% on math benchmarks at the cost of N× API calls. Use for high-stakes decisions where latency and cost allow.
How does CoT relate to “thinking” in Claude 3.7 Sonnet and o1-series models?
Claude’s extended thinking and OpenAI’s o1/o3 models generate reasoning tokens (a form of CoT) internally before producing the response. The reasoning is done at inference time by the model itself, not prompted — but the underlying mechanism is the same: chain-of-thought reasoning.
Next Steps
- ReAct Paper Explained — How CoT became agentic
- Prompt Engineering for AI Agents — Apply CoT in production
- What Is an AI Agent? — The broader picture