Intermediate Multi-agent 18 min read

Agent Evaluation and Benchmarks: How to Measure Multi-Agent Performance

#evaluation #benchmarks #swe-bench #gaia #hal #adp #agent-data-protocol #testing
📚

Read these first:

You launched your multi-agent system, watched it work on test inputs, and it looked great. Then it failed a real task — not catastrophically, but quietly, producing a subtly wrong output that a downstream process silently consumed. By the time you noticed, three other tasks had already built on that wrong foundation.

This is the core problem with agent evaluation. Unlike a text classifier where you feed in an input and check an output, evaluating an agent means tracking a chain of decisions across time, tools, and state changes — where the cost of a small early error compounds with every subsequent step.

This article covers how agent evaluation differs from classic LLM evaluation, how the two main evaluation categories serve different purposes, how modern grading methods have evolved, which benchmarks the research community trusts in 2026, and how to design your own evaluation suite for a production system.


Why Agent Evaluation Is Different

The most important thing to understand about agent evaluation is that it is not a scaled-up version of LLM evaluation. It is a structurally different problem.

Three properties separate agent evaluation from standard model evaluation:

1. Multi-turn sequential decisions — errors compound exponentially.

A single LLM call produces one output. A multi-agent workflow produces a chain of decisions, where each step’s input depends on the previous step’s output. The compounding effect is severe. If each step in a ten-step workflow succeeds 90% of the time, the probability that the entire workflow completes correctly is roughly 35%. A system that looks highly reliable at the step level can be dramatically unreliable at the task level.

This means that a 5% improvement in per-step accuracy can produce a 20-30% improvement in end-to-end task success — and small regressions can cause dramatic end-to-end failures that are not obvious from step-level metrics.

2. State-changing actions — you cannot “undo” for testing.

When a text model produces an answer you do not like, you simply try again. When an agent deletes a file, pushes a git commit, posts a message, or executes a database query, you cannot undo those actions to re-test from the same state. This makes evaluation fundamentally different: you need isolated sandboxed environments that can be reset between test runs. The cost of running evaluations is higher, and the infrastructure requirements are more demanding.

3. Non-determinism at scale — same prompt, different path.

LLMs are stochastic. A single agent with a single task can produce wildly different execution paths across runs, especially in long workflows where early choices branch into different sub-strategies. Evaluating an agent once and reporting pass/fail is nearly meaningless. You need multiple runs per task, and your evaluation infrastructure must aggregate over that variance.

The table below contrasts the two evaluation paradigms directly:

DimensionLLM EvaluationAgent Evaluation
InputStatic promptInitial task description
OutputSingle text responseMulti-step trajectory + final output
Success criteriaQuality of one responseCorrect end state after N actions
Error typeSingle-step failureCompounding failure across steps
Measurement unitPer-response scorePer-task + per-trajectory score
EnvironmentStatelessStateful, requires reset between runs
VarianceLow (single call)High (long chains, branching)

The Two Categories of Agent Evaluation

Anthropic’s internal research on agent systems distinguishes two fundamentally different purposes for evaluation, each requiring a different design philosophy.

Capability Evaluations

Capability evaluations answer the question: where does this agent still fail?

The design goal is to find the system’s current frontier — the boundary between tasks it handles reliably and tasks it cannot yet solve. A good capability evaluation suite is deliberately challenging. High failure rates on capability evals are not a sign that something went wrong; they are the signal the evaluation is supposed to produce. Tasks where the agent consistently fails tell you exactly where to focus development effort.

Capability evals should be structured as a representative sample of tasks that are at the edge of what the agent can currently do. They should span a range of difficulty levels, task types, and environmental conditions. The output you want is a capability map: “this agent reliably handles X, struggles with Y, and consistently fails at Z.”

Capability evals are run periodically — monthly or after significant model or architecture changes — to track improvement over time.

Regression Evaluations

Regression evaluations answer the opposite question: does this agent still do the things it did before?

The design goal is to protect existing functionality. A regression eval is not supposed to be hard. It is supposed to test behaviors the system already handles correctly, and it is supposed to pass near 100% of the time. If a regression eval fails after a system update, that is a rollback trigger — something that was working is now broken.

Regression evals run automatically on every deployment. They are the automated quality gate between a code change and production traffic.

The critical discipline here is keeping regression test cases fixed. As the agent improves and formerly-hard tasks become easy, you should move those tasks into the regression suite and add new, harder tasks to the capability suite. But the regression suite itself should not be updated to match the agent’s current state — it should continue to test the same fixed behaviors over time.

PropertyCapability EvaluationRegression Evaluation
PurposeFind the frontier of failureProtect existing functionality
Expected pass rateVariable — failure is informativeNear 100% — failure triggers rollback
Run frequencyMonthly or post-major-updateEvery deployment
Task setRepresentative of current limitsFixed, tests known-working behaviors
DifficultyHard tasks at the frontierTasks the agent already handles
OutputCapability mapPass/fail deployment gate

Grading Methods: From Simple to Semantic

Once you have defined what tasks to test and what constitutes a pass, you need a method to grade the agent’s output. Grading methods have evolved considerably, and each has appropriate use cases.

Code-Based Graders

Code-based graders check agent outputs programmatically. The simplest version is string matching: does the agent’s final output contain a specific keyword? A more sophisticated version compares structured outputs exactly: does the JSON returned match the expected JSON schema? Does the generated code compile and produce the expected result when executed?

Code-based graders are fast, deterministic, and cheap. They produce no false positives — if they pass, the output genuinely matched the expected structure.

Their weakness is brittleness. If an agent produces the correct answer in a slightly different format — a date written as “April 8, 2026” instead of “2026-04-08”, for example — a string-matching grader will mark it as a failure even though the agent was correct. This makes code-based graders poorly suited for open-ended tasks where correctness is a matter of meaning rather than structure.

Code-based graders are the right choice for tasks with guaranteed output formats: structured JSON, SQL queries, code that must compile and produce specific output, API calls with exact expected parameters.

Model-Based Graders

Model-based graders use an LLM to evaluate the agent’s output semantically. Instead of checking whether the output matches an exact pattern, you send the task description, the agent’s output, and an evaluation rubric to a judge model, and ask it to assess whether the agent’s response is correct.

This approach handles natural language variation naturally. It can assess intent, not just structure. “Did the agent correctly identify all three security vulnerabilities in this code?” is a question a model-based grader can answer even if the agent described each vulnerability in slightly different terms than the expected answer.

The cost is one additional LLM call per evaluation step, and the possibility of grader error — an LLM judge can be wrong, especially on edge cases. Model-based graders should be calibrated: run them alongside human graders on a subset of tasks to establish how often the model judge agrees with a human expert.

Model-based graders are the right choice for open-ended tasks where correctness is semantic: research quality, code review accuracy, multi-step reasoning, agent decision explanations.

Human Graders

Human graders — subject matter experts manually reviewing agent outputs — are the ground truth of evaluation. They are the most accurate method and also the most expensive.

In practice, human grading is used in three scenarios: as the baseline for calibrating automated graders, for ambiguous cases that code-based and model-based graders cannot reliably handle, and for high-stakes decisions where the cost of a grader error is high.

A production-scale evaluation operation typically uses automated graders for the bulk of assessment and applies human review to a spot-check sample — often 5 to 10 percent of evaluations — to catch grader drift and maintain calibration over time.


The Major Benchmarks in 2026

The agent research community has converged on several standard benchmarks for measuring agent capability. Understanding what each benchmark actually tests — and what it does not — is essential for interpreting published results.

SWE-bench Verified

SWE-bench tasks an agent with resolving real GitHub issues in real open-source repositories. The agent receives the issue text, access to the repository, and must produce a code patch that makes failing tests pass without breaking existing tests.

This benchmark is demanding precisely because it requires the full stack of software engineering cognition: reading unfamiliar code, understanding a bug report, navigating a large codebase, writing a correct patch, and verifying it against a test suite — all without human guidance during the task.

Progress on SWE-bench has been dramatic. Top systems in 2026 achieve 75-80%+ solve rates, up from roughly 20% in 2024. One striking finding from 2026 research is that a surprisingly simple harness — approximately 100 lines of Python scaffolding wrapping current frontier models — achieves competitive performance with much more complex purpose-built systems. This suggests that raw model capability improvement is driving a large share of the benchmark gains, not just harness sophistication.

SWE-bench is the standard benchmark for comparing coding agent frameworks and evaluating base model capability on software engineering tasks.

GAIA

GAIA (General AI Assistant) tests agents on tasks that a human with access to Google Search can solve in roughly five minutes — retrieving a specific fact from a webpage, doing a multi-step calculation, extracting structured information from a document, combining web search with numerical reasoning.

These tasks sound simple, but agents find them surprisingly difficult. The benchmark’s design insight is that “easy for a human” is not the same as “easy for an agent.” Agents often fail at tasks that require chaining together simple sub-steps in the right order, or that require knowing when to stop searching and commit to an answer.

GAIA is the standard benchmark for measuring general-purpose agent reliability on real-world assistant tasks.

AgentBench

AgentBench tests agents across a diverse set of execution environments: operating system tasks, web browser navigation, database query and modification, knowledge graph traversal, and household simulation. The multi-environment design surfaces whether an agent’s capability is general or environment-specific.

A system that scores well on coding tasks may perform poorly on web navigation tasks even when both require planning and tool use. AgentBench makes those capability gaps visible across domains.

HAL — Holistic Agent Leaderboard

HAL, developed at Princeton, is the benchmark that researchers in 2026 actually trust for evaluating production deployment readiness. Its key innovation is going beyond task solve rate to measure four dimensions simultaneously: reliability, robustness, cost-efficiency, and safety.

The reliability dimension measures whether the system performs consistently across repeated runs of the same task — not just whether it can solve a task once. The robustness dimension measures how performance degrades under distribution shift: slightly rephrased instructions, different environments, unexpected input formats.

The cost-efficiency dimension is HAL’s most practically important contribution. A system that solves 80% of benchmark tasks at a cost of $50 per task is not more useful than one that solves 75% of tasks for $2 per task. HAL’s leaderboard makes cost visible alongside accuracy, which is essential for evaluating whether a system is actually deployable in production at realistic scale.

The safety dimension measures whether the system refuses or flags appropriately harmful or ambiguous requests, rather than blindly complying.

BenchmarkDomainWhat It TestsKey Metric
SWE-bench VerifiedSoftware engineeringReal GitHub issue resolution% tasks solved
GAIAGeneral assistanceMulti-step real-world tasks% tasks solved by difficulty level
AgentBenchMulti-environmentCross-domain robustnessPer-environment score
HALProduction readinessReliability, cost, safety holisticallyMulti-dimensional dashboard

Designing Your Own Evaluation Suite

Public benchmarks tell you how your system compares to others on standard tasks. They do not tell you how it performs on your specific use case. For production systems, you need an evaluation suite designed around your actual task distribution.

Step 1: Define Your Task Distribution

List 20 to 50 representative tasks that your agent must handle in production. The list should span three difficulty tiers: easy tasks your agent should pass 100% of the time, medium tasks it should pass around 80% of the time, and hard tasks at the frontier of its current capability where a 50% pass rate is acceptable.

The list must also include edge cases: empty input, malformed input, ambiguous instructions, tasks that require the agent to appropriately refuse or ask for clarification rather than proceeding with a wrong assumption. Edge cases are where agents fail most often in production, and they are the most commonly omitted from evaluation suites.

Step 2: Choose Your Grading Method

Match your grading method to the output type of each task. Structured output tasks — producing JSON in a specific schema, running code that must produce specific outputs, filling in specific form fields — should use code-based graders. Tasks involving semantic quality — assessing the correctness of a research summary, evaluating whether a code review caught the right issues — should use model-based graders with an explicit rubric. Any task where mistakes are high-stakes should include a human spot-check layer on top of automated grading.

Calibrate your model-based graders before relying on them. Run 50-100 tasks through both the model grader and a human expert, measure the agreement rate, and document where the model grader tends to fail. A grader with an 85% agreement rate is useful; one with 60% agreement rate is a liability.

Step 3: Separate Capability from Regression

Once you have an initial task list, divide it into two pools. The regression pool contains tasks the agent already handles correctly — these will run on every deployment as your automated quality gate. The capability pool contains tasks at the frontier — these run periodically to track improvement over time.

As the agent improves and tasks move from “failing” to “reliably passing,” graduate those tasks from the capability pool into the regression pool. Never update the regression pool’s expected behavior to match a regression — if the agent now produces different output for a previously-passing task, that is still a regression, even if the new output looks reasonable.

Step 4: Track Cost and Latency

Every evaluation run should record per-task token cost and wall-clock latency alongside accuracy scores. This data is essential for production decision-making. An agent that achieves a 5% accuracy improvement at 3x the cost may not be worth deploying, depending on the use case. HAL’s multi-dimensional leaderboard approach should inform your own evaluation infrastructure — cost-efficiency is a first-class metric, not an afterthought.


The Harness Entanglement Problem

A finding from 2026 Stanford research deserves particular attention because it complicates how we interpret all benchmark results: when an AI system was used to improve its own evaluation harness, the improved harness led the same model to significantly outperform Claude Code on TerminalBench 2.

The implication is profound. Benchmark scores do not reflect only the model’s capability — they reflect the joint quality of the model and the evaluation harness. A better harness can make the same model look dramatically more capable on the same benchmark.

This creates a responsible reporting obligation. When you publish benchmark results or compare systems, you must document the harness configuration, not just the model. Results produced with a heavily optimized harness are not directly comparable to results from a simple harness, even on the same benchmark.

A useful framework for thinking about this is to decompose each benchmark result into three components: Control (the harness — scaffolding, retry logic, tool availability, prompt formatting), Agency (the model — the AI being evaluated), and Runtime (the execution environment — compute, API latency, tool implementations). All three components contribute to the observed score. Reporting only the model while leaving the harness and runtime implicit makes results misleading.

Practically, this means that when comparing two agent systems using a benchmark, you should hold the harness and runtime constant and vary only the model. When you improve the harness for one system, rerun the comparison with the improved harness for all systems before reporting results.


Frequently Asked Questions

How many test cases do I need for a valid eval suite?

There is no universal minimum, but the practical floor for statistical validity is around 50 tasks per evaluation category. Fewer than 50 tasks produces pass rates with wide confidence intervals — a system that solves 60% of 20 tasks might be anywhere from 38% to 79% at the true task distribution. For regression evaluations, 50-100 tasks is typically sufficient if they cover the main behavioral dimensions of your system. For capability evaluations, more tasks at the frontier give you better resolution on where exactly the system’s limits lie. Start with 50 and expand as you discover gaps.

Can I use SWE-bench to evaluate agents on my own codebase?

SWE-bench is designed around a fixed set of open-source repositories, so you cannot directly apply it to private or custom codebases. However, the benchmark methodology — identifying real bugs in real code, verifying fixes with existing test suites — transfers directly. You can construct a private benchmark in the same style using your own repository’s issue history: select resolved issues where a test was added to verify the fix, use those test additions as your grading criterion, and present the pre-fix state of the repository as the agent’s starting point. This gives you a SWE-bench-style eval grounded in your actual codebase.

How do I prevent overfitting to a benchmark?

Overfitting to benchmarks is a real risk. Systems tuned heavily on SWE-bench training data may show strong SWE-bench numbers while being brittle on similar-but-different real tasks. The best protection is a combination of held-out test sets (never use benchmark test data for training), diversity in your evaluation suite (multiple benchmarks testing different aspects of capability), and regular comparison to production task performance (if benchmark scores improve but production success rates do not, you are overfitting). Also be wary of harness optimization specific to the benchmark — improvements that come from tuning prompt formatting specifically for a benchmark’s test cases do not generalize.

What is the cheapest way to start evaluating my agent?

Start with 20 to 30 representative tasks from your actual use case, manually labeled with expected outputs. Build simple code-based graders for any task with a structured expected output. For the remaining tasks, run them through a small LLM judge with a brief rubric and spot-check 30% of the judge’s results manually. This setup can be built in a day, costs almost nothing to run, and gives you a meaningful signal about whether system changes are improvements or regressions. Once you have this baseline, add the more sophisticated model-based grading and harness infrastructure iteratively as your system scales.


Next Steps

With a solid understanding of evaluation methodology, two natural directions follow:

Evaluation infrastructure requires a harness — the scaffolding layer that runs your agent, manages its environment, resets state between test runs, and collects results. The design of that harness directly affects your evaluation scores. See What Is Harness Engineering for a full treatment of harness architecture and the engineering decisions it entails.

If your harness uses natural language task specifications rather than code-defined procedures, the design of those specifications becomes a first-class concern. Natural Language Agent Harnesses covers how to write task descriptions that produce reliable, comparable evaluations across runs.

Once your evaluation infrastructure is in place, you will want to use it to compare frameworks. CrewAI vs AutoGen applies evaluation thinking to a concrete framework comparison — walking through how the two systems perform differently under the conditions that your own eval suite will surface.

Related Articles