MetaGPT Data Interpreter: AI-Powered Data Analysis Agent

Q: How do I prevent the interpreter from using the internet?

Set tools to exclude web search: python interpreter = DataInterpreter(tools=["datapreprocess", "visualization", "modeltrain"])

What Is the Data Interpreter?

The MetaGPT Data Interpreter is a specialized agent within MetaGPT designed specifically for data analysis tasks. Unlike the general software company simulation, the Data Interpreter:

Breaks complex data tasks into executable steps
Writes and runs Python code autonomously in a feedback loop
Produces visualizations, statistical summaries, and reports
Iterates on errors until the analysis is complete

It was benchmarked on InfiAgent-DABench and outperforms GPT-4 (Code Interpreter) on many tasks.

Quick Start

import asyncio
from metagpt.roles.di.data_interpreter import DataInterpreter

async def analyze():
    interpreter = DataInterpreter()

    await interpreter.run(
        "Load the Iris dataset, perform EDA, create visualizations, "
        "and train a classifier. Report accuracy."
    )

asyncio.run(analyze())

The Data Interpreter will:

Write Python code to load and explore the dataset
Execute it, observe output/errors
Iterate until complete
Save results and visualizations to the workspace

Analyzing a Local Dataset

import asyncio
from metagpt.roles.di.data_interpreter import DataInterpreter

async def analyze_sales_data():
    interpreter = DataInterpreter()

    requirement = """
    Analyze the sales data in 'sales_2025.csv':
    1. Load and display basic statistics (shape, dtypes, missing values)
    2. Show top 10 products by revenue
    3. Plot monthly revenue trend as a line chart
    4. Identify the best and worst performing regions
    5. Calculate year-over-year growth if multiple years present
    6. Save all charts as PNG files
    """

    await interpreter.run(requirement)

asyncio.run(analyze_sales_data())

Make sure the CSV file is in the working directory (or provide an absolute path in the requirement string).

Configuring the Data Interpreter

from metagpt.roles.di.data_interpreter import DataInterpreter
from metagpt.schema import Message

# Customize the interpreter
interpreter = DataInterpreter(
    use_reflection=True,   # enables self-correction on errors
    tools=["<all>"],       # available tools (default: all)
)

# The interpreter can use these tool categories:
# - "data_preprocess": pandas, numpy operations
# - "visualization": matplotlib, seaborn, plotly
# - "feature_engineering": sklearn transformers
# - "model_train": sklearn, xgboost models
# - "model_evaluate": metrics, confusion matrix
# - "web_search": look up formulas or documentation

Step-by-Step Task Decomposition

The Data Interpreter breaks your request into tasks with code. You can see this in action:

import asyncio
from metagpt.roles.di.data_interpreter import DataInterpreter

async def main():
    interpreter = DataInterpreter(use_reflection=True)

    # Watch the task plan being built
    result = await interpreter.run(
        "Predict house prices using the Boston Housing dataset. "
        "Compare Linear Regression, Random Forest, and XGBoost. "
        "Report RMSE and feature importance for the best model."
    )

    # The interpreter's plan will look like:
    # Task 1: Load dataset
    # Task 2: EDA and preprocessing
    # Task 3: Feature engineering
    # Task 4: Train models and compare
    # Task 5: Visualize results

asyncio.run(main())

Real-World Example: Financial Analysis

import asyncio
from metagpt.roles.di.data_interpreter import DataInterpreter

async def financial_analysis():
    interpreter = DataInterpreter()

    await interpreter.run("""
    Download AAPL, GOOGL, MSFT stock data for the last 2 years using yfinance.
    Perform the following analysis:
    1. Calculate and plot cumulative returns for each stock
    2. Compute correlation matrix between stocks
    3. Calculate Sharpe ratio for each (assume risk-free rate 4.5%)
    4. Identify the 5 best and worst trading days for each stock
    5. Build a simple portfolio with equal weights, plot vs S&P 500
    6. Generate a comprehensive report as a markdown file
    """)

asyncio.run(financial_analysis())

The Data Interpreter automatically handles pip install yfinance if needed (it detects import errors and installs missing packages).

Machine Learning Pipeline

import asyncio
from metagpt.roles.di.data_interpreter import DataInterpreter

async def ml_pipeline():
    interpreter = DataInterpreter()

    await interpreter.run("""
    Build a customer churn prediction model:
    Dataset: 'churn_data.csv' with columns: customer_id, tenure, monthly_charges,
             total_charges, contract_type, churn (target)

    Steps:
    1. Load data, handle missing values, encode categoricals
    2. Perform feature engineering (e.g., charges per month ratio)
    3. Split 80/20 train/test with stratification
    4. Train and tune: LogisticRegression, RandomForest, XGBoost
    5. Select best model by F1 score on validation set
    6. Plot ROC curve and confusion matrix for best model
    7. Output feature importance and top 5 churn indicators
    8. Save the trained model as 'churn_model.pkl'
    """)

asyncio.run(ml_pipeline())

Using Custom Tools

The Data Interpreter supports custom tools you register:

from metagpt.tools.tool_registry import register_tool

@register_tool(include_functions=["analyze_text"])
class TextAnalysisTool:
    """Custom text analysis utilities."""

    def analyze_text(self, text: str) -> dict:
        """
        Analyze text sentiment and extract keywords.

        Args:
            text: Input text to analyze

        Returns:
            Dict with 'sentiment' (positive/negative/neutral) and 'keywords' (list)
        """
        words = text.lower().split()
        positive_words = {"good", "great", "excellent", "amazing", "best"}
        negative_words = {"bad", "poor", "terrible", "worst", "awful"}

        pos = sum(1 for w in words if w in positive_words)
        neg = sum(1 for w in words if w in negative_words)

        sentiment = "positive" if pos > neg else "negative" if neg > pos else "neutral"
        keywords = [w for w in set(words) if len(w) > 4][:10]

        return {"sentiment": sentiment, "keywords": keywords}

Now the Data Interpreter can use analyze_text() in generated code.

Output Artifacts

After running, check the workspace for generated files:

workspace/
└── data_interpreter_YYYYMMDD_HHMMSS/
    ├── code/
    │   ├── task1_load_data.py
    │   ├── task2_eda.py
    │   └── task3_model.py
    ├── charts/
    │   ├── monthly_revenue.png
    │   ├── correlation_matrix.png
    │   └── roc_curve.png
    └── report.md

Integration with Notebooks

Export Data Interpreter output to a Jupyter notebook:

import asyncio
from metagpt.roles.di.data_interpreter import DataInterpreter

async def to_notebook():
    interpreter = DataInterpreter()

    result = await interpreter.run(
        "Analyze the Titanic survival dataset with full EDA and a survival predictor."
    )

    # Export generated code to .ipynb
    # The code files in workspace can be assembled with nbformat:
    import nbformat

    nb = nbformat.v4.new_notebook()
    # Read the generated code files and add as cells
    import glob
    code_files = sorted(glob.glob("./workspace/**/code/*.py", recursive=True))
    for filepath in code_files:
        with open(filepath) as f:
            nb.cells.append(nbformat.v4.new_code_cell(f.read()))

    with open("analysis.ipynb", "w") as f:
        nbformat.write(nb, f)

asyncio.run(to_notebook())

Frequently Asked Questions

How is Data Interpreter different from ChatGPT’s Code Interpreter?

ChatGPT’s Code Interpreter is a single-turn tool. MetaGPT’s Data Interpreter:

Breaks tasks into planned sub-tasks
Self-corrects when code fails (reflection loop)
Maintains state across multiple iterations
Can use external tools and web search
Generates complete, reproducible code files

What datasets can it work with?

Any format pandas reads: CSV, Excel, JSON, Parquet, SQL databases. For SQL, provide a connection string in the requirement: “Connect to postgresql://localhost/db and analyze the orders table.”

How do I prevent the interpreter from using the internet?

Set tools to exclude web search:

interpreter = DataInterpreter(tools=["data_preprocess", "visualization", "model_train"])

Can it handle very large datasets?

For files > 1GB, the interpreter should use chunked loading. Include this in your requirement: “The dataset is 5GB — use chunked loading with pandas.” The interpreter will adapt its approach.

How much does a typical analysis cost?

A moderate data analysis task (5–10 steps): $0.10–$0.50 with gpt-4o-mini, $0.50–$2.00 with gpt-4o. Complex ML pipelines: $1–$5.

Next Steps

MetaGPT Custom Roles and Actions — Build custom agents for your domain
MetaGPT Use Cases and Examples — See more real-world applications
LlamaIndex Agents and Tools — Compare with LlamaIndex’s data-oriented agents