Intermediate Papers 6 min read

Toolformer Explained: Teaching LLMs to Use Tools

#toolformer #tool-use #self-supervised #paper #meta #function-calling #agents

Paper Overview

“Toolformer: Language Models Can Teach Themselves to Use Tools” Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., Scialom, T. (2023) Meta AI Research Advances in Neural Information Processing Systems (NeurIPS 2023)

arxiv.org/abs/2302.04761

Why it matters: Before Toolformer, teaching an LLM to use tools required human-annotated examples of when and how to call each tool. Toolformer showed that LLMs can generate their own training data for tool use — a self-supervised approach that scales to any tool without manual labeling. This work directly influenced the development of function calling in GPT-4, Claude, and similar APIs.


The Core Problem

In 2022, LLMs were already impressive at language tasks but had clear gaps:

  • They couldn’t do arithmetic reliably
  • Their knowledge had a cutoff date
  • They couldn’t look up specific facts
  • They couldn’t translate accurately for rare languages

The obvious solution: give the model access to a calculator, a search engine, a calendar, and a translation API. But how do you teach the model when to use these tools?

The naive approach: human annotators label thousands of examples — “here, in this sentence, the model should call the calculator.” This is expensive and doesn’t scale.

Toolformer’s insight: LLMs are good enough at generating text that they can create their own training data for tool use.


The Self-Supervised Training Pipeline

Step 1: Sample Candidate Tool Insertions

Given a dataset of text, the model is prompted to suggest where API calls could be inserted to help predict what comes next:

Original: "... I bought 3 pencils at $0.75 each and paid with a $5 bill.
            My change was $..."

Model generates candidates:
"... I bought 3 pencils at $0.75 each and paid with a $5 bill.
 My change was [Calculator(5 - 3 * 0.75) →] $..."

The notation [API_name(input) →] is inserted where the model thinks an API call would help.

Step 2: Execute the API Calls

For each candidate insertion, execute the actual API and get the result:

[Calculator(5 - 3 * 0.75) →] executed → 2.75

Text becomes:
"... My change was [Calculator(5 - 3 * 0.75) → 2.75] $..."

Step 3: Filter: Does the API Call Actually Help?

This is the key insight. The model predicts the next tokens in two ways:

  1. Without the API call
  2. With the API call result included

If the API result reduces perplexity (makes the continuation more predictable), the example is kept. If not, it’s discarded.

# Pseudocode for filtering
def should_keep_api_call(text_before: str, api_result: str, text_after: str) -> bool:
    # Perplexity = how "surprised" the model is by the continuation
    ppl_without_api = compute_perplexity(text_before, text_after)
    ppl_with_api = compute_perplexity(text_before + api_result, text_after)

    # Keep only if API result reduces perplexity by at least threshold L
    threshold = 2.0  # paper uses L = 2.0
    return (ppl_without_api - ppl_with_api) > threshold

This filter ensures the training data only contains useful API calls — not gratuitous ones.

Step 4: Fine-tune on Filtered Dataset

The model is fine-tuned on the resulting dataset, where useful API calls are integrated naturally into text. The model learns:

  • Which situations call for which tools
  • How to format API calls correctly
  • When NOT to use tools

Tools Used in the Paper

The paper demonstrates Toolformer on 5 tools:

ToolAPI Call FormatPurpose
Calculator[Calculator(3 + 5) → 8]Arithmetic
Wikipedia Search[Wikipedia(GPT-4) → ...]Factual lookup
Machine Translation[MT(Hola Mundo, es) → Hello World]Translation
Calendar[Calendar() → Tuesday, April 8]Current date
Question Answering[QA(capital of France) → Paris]Direct QA

During inference, when the model generates an API call token sequence, execution is triggered, the result is inserted, and generation continues.


Key Results

Toolformer outperforms much larger models on several benchmarks:

TaskGPT-3 (175B)OPT (66B)Toolformer (6.7B)
Math QA (ASDiv)14.0%4.5%40.4%
SVAMP (arithmetic)12.0%4.0%29.4%
MSNBC (date reasoning)27.4%4.3%51.8%
WikiQA (QA)65.2%46.5%68.1%

Toolformer (6.7B parameters) outperforms GPT-3 (175B parameters) on arithmetic tasks — a 26x smaller model winning because it can use a calculator, while GPT-3 has to compute in-weights.


What Toolformer Got Right

Self-supervised data generation — No human annotation needed. This scales to any tool.

Decide when to use tools — Unlike brute-force approaches (always call search), Toolformer learns to call tools only when beneficial. On language modeling tasks without factual gaps, it behaves like a standard LM.

Single-call tools — The architecture is elegant for non-interactive tools (call once, get result).


Limitations and What Changed

Single tool calls only — Toolformer was designed for single API calls embedded in text generation. It doesn’t support iterative tool use (call a tool, observe the result, call another tool based on that result). This is exactly what ReAct adds.

No multi-turn reasoning — The tool call format [API(...) → result] is inline. It doesn’t support the Thought/Action/Observation loop needed for complex agent tasks.

Small scale — The paper used a 6.7B GPT-J variant. Whether the approach scales to frontier models with many tools was left open.

Training required — Unlike prompting-based approaches (ReAct), Toolformer requires fine-tuning the model. This limits flexibility.


Toolformer’s Influence on Function Calling

Modern LLM APIs implement tool use differently from Toolformer, but the core insight carries through:

# OpenAI function calling — influenced by Toolformer's tool use paradigm
from openai import OpenAI

client = OpenAI()

tools = [
    {
        "type": "function",
        "function": {
            "name": "calculator",
            "description": "Evaluate a mathematical expression",
            "parameters": {
                "type": "object",
                "properties": {
                    "expression": {
                        "type": "string",
                        "description": "Math expression to evaluate, e.g. '5 - 3 * 0.75'"
                    }
                },
                "required": ["expression"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "I bought 3 pencils at $0.75 each and paid with $5. What's my change?"}
    ],
    tools=tools,
    tool_choice="auto",  # model decides when to call the tool
)

# If the model calls the tool:
tool_call = response.choices[0].message.tool_calls[0]
print(tool_call.function.arguments)
# → {"expression": "5 - 3 * 0.75"}

The GPT-4 training almost certainly involved Toolformer-inspired techniques where the model learned to recognize when to call external functions.


Frequently Asked Questions

Is Toolformer’s approach still used?

The self-supervised data generation technique is likely used internally by labs (not disclosed). What’s public: GPT-4 and Claude’s function calling was trained on (at minimum) similar data — demonstrations of when tool calls are appropriate. The exact technique varies by model.

What’s the difference between Toolformer and ReAct?

Toolformer trains the model to use tools (fine-tuning required). ReAct prompts the model to use tools (no training). Toolformer is inline (tool calls embedded in text flow). ReAct is interactive (observe result, reason, act again). Modern systems often use both: a model trained with tool use (Toolformer-style) guided by ReAct-style prompting.

Can I use Toolformer with open-source models?

Yes. The technique works on any instruction-tuned model. The Hugging Face paper code and community implementations (like ToolLlama) apply similar techniques to Llama and other open models.

Why doesn’t Toolformer support multi-step tool use?

The paper’s formulation puts API calls inline with text generation. Supporting iterative tool use would require a more complex generation loop — which is what frameworks like LangChain and AutoGen implement at the orchestration layer, not the model level.


Next Steps

Related Articles