Natural-Language Agent Harnesses (NLAH): The ICLR 2026 Breakthrough

Agent harness engineering has a scientific reproducibility problem. When two teams publish competing results, it is almost never clear whether the winning system had a better model, a better harness, or a better combination of both. The Natural-Language Agent Harness (NLAH) paper, accepted at ICLR 2026 from a joint Tsinghua University and Anthropic collaboration, proposes a structural fix: externalize harness control logic into human-readable natural-language contracts, and build a shared runtime that executes them. The result is a harness architecture that is portable, ablatable, and — for the first time — scientifically comparable across systems.

This article covers the NLAH architecture in detail, explains the Intelligent Harness Runtime (IHR) that powers it, examines the context rot problem that motivated its memory design, and synthesizes the model-versus-harness debate that NLAH’s standardized approach is intended to resolve.

The Problem With Code-Based Harnesses

Before NLAH, every major production agent system embedded its harness logic directly inside platform-specific code. Claude Code’s harness logic lives in TypeScript. OpenHands implements its orchestration in Python. Devin’s scaffolding was proprietary and never published at all. Each system made dozens of implicit decisions about retry limits, error recovery, task decomposition, and state management — all buried inside the implementation.

This created three compounding problems that NLAH’s authors identified as blocking scientific progress.

No portability. A harness designed for a Python agent runtime could not be ported to a JavaScript agent without a full rewrite. The control logic was inseparable from the language runtime. Teams building on different stacks had to independently rediscover the same harness patterns — or skip the harness sophistication entirely.

No comparability. When system A outperforms system B on a benchmark, the difference could come from the model, from the harness, or from the interaction between them. With code-based harnesses, there was no systematic way to hold the harness constant while varying the model — or vice versa. Every published result was an opaque bundle.

No ablation surface. Ablation studies — removing one component at a time to understand its contribution — are the standard tool for understanding complex systems. But if a harness is a monolithic Python class with 3,000 lines of orchestration logic, it is not clear what you would even ablate. There are no clean seams. Removing the retry logic also implicitly changes the error propagation behavior. The harness was not a scientific object; it was an engineering artifact.

The NLAH authors argued that the root cause was architectural: the harness control logic had never been separated from the code that implements it. Code encapsulates behavior and makes it opaque. Natural language externalizes behavior and makes it inspectable. The solution was to stop writing harnesses in code and start writing them in the medium humans actually read and edit: plain English.

NLAH: Externalizing the Control Logic

The Core Idea

Consider a concrete example of what the shift looks like. A traditional code-based harness might include a file-writing validation guard expressed as:

if agent_action.type == "write_file":
    if not agent_action.path.startswith(WORKSPACE_ROOT):
        raise SecurityException(f"Path {agent_action.path} is outside workspace")

In NLAH, the same constraint is a Contract written in plain English:

“Before writing any file, the agent must verify that the target path exists within the designated workspace directory. Any attempt to write outside this boundary must be rejected and logged as a constraint violation.”

These two representations encode the same behavior, but they have fundamentally different properties. The Python version is precise but requires a developer to read and modify. The natural-language Contract is readable by any project stakeholder, editable without ML expertise, and portable to any agent system that runs NLAH’s runtime engine — the IHR.

The IHR (Intelligent Harness Runtime) is the engine that bridges these two representations. It reads the Contracts at runtime, interprets them semantically, and enforces them on each agent action before execution. The agent never sees the enforcement code. It only sees its task, its tools, and the results of its actions.

The NLAH Architecture

NLAH defines six components, all specified in natural language. Together, they cover the complete surface area of harness design.

1. Contracts

Contracts are behavioral rules the agent must follow at all times. They are written declaratively — describing what must be true, not how to enforce it. That is the IHR’s job.

Contracts cover security boundaries, quality gates, escalation triggers, and communication norms. A well-designed Contract set for a coding agent might look like:

“All generated code must include at least one test case before it is considered complete.”
“When a test fails three consecutive times with the same error, escalate to human review rather than retrying.”
“External API calls must never include credentials in plain text.”

The critical design property is that Contracts are additive. A base NLAH configuration might have ten Contracts. A specific deployment can add domain-specific Contracts without touching the rest. Ablation studies remove Contracts one at a time and measure the effect on agent performance — which is exactly the scientific workflow that code-based harnesses could not support.

2. Roles

Roles define agent persona and responsibility. In multi-agent systems, Roles do more than assign names — they specify decision authority, allowed action types, and communication patterns.

A Role definition might read: “The Reviewer agent may inspect and flag code but may not directly modify files. It communicates findings to the Coder agent as structured comments beginning with REVIEW:. The Coder agent is required to address all REVIEW: comments before requesting final approval.”

This is richer than a system prompt. The Role is part of the harness, not the model’s context. The IHR enforces the behavioral constraints of each Role regardless of what the underlying model generates. If the Reviewer agent attempts a direct file write, the IHR blocks it based on the Role definition — even if the model’s reasoning justified the action.

3. Stage Structure

Stage Structure decomposes the overall task into phases with explicit entry and exit conditions. Unlike prompt-based task decomposition — which lives in the context window and can be forgotten — Stage Structure is part of the harness contract.

A coding task might be structured as:

“Stage 1 (Analysis): Complete when the agent has produced a written plan identifying all files to be modified and the rationale for each change. Stage 2 (Implementation): Begin only when Stage 1 is complete. Complete when all tests pass and the agent has committed all changes. Stage 3 (Review): Begin only when Stage 2 is complete.”

The IHR tracks which Stage the agent is currently in and enforces that stage transitions only occur when exit conditions are met. An agent that tries to jump from Stage 1 to Stage 3 will be redirected. This is harness-level task management — it does not depend on the model remembering its own instructions.

4. Adapters

Adapters are natural-language interface definitions between the harness and external tools. They describe how each tool should be called, what its outputs mean, and how the agent should interpret edge cases.

An Adapter for a code execution tool might include: “If the execution environment returns a non-zero exit code, treat this as a test failure regardless of whether the error message explicitly mentions a test. If execution times out after 30 seconds, treat this as an environment failure — do not retry more than once.”

Adapters make tool behavior explicit and inspectable. When a tool is updated or replaced, the Adapter is updated — and any agent using the NLAH harness immediately picks up the new behavior without code changes.

5. Scripts

Scripts are reusable behavior patterns — analogous to functions in code, but expressed in natural language. They capture multi-step response sequences that the agent should follow in specific situations.

An error recovery Script might read:

“Error Recovery Procedure: (1) Log the full error message and the action that triggered it. (2) Revert the most recent file modification. (3) Analyze the root cause by inspecting the error, the reverted file, and the agent’s last five reasoning steps. (4) Formulate a corrected approach that avoids the identified root cause. (5) Retry the action using the corrected approach. If this retry also fails, escalate to human review.”

Scripts are invoked by the IHR when matching conditions are detected — for example, when a Contract specifies “on error, execute the Error Recovery Script.” Scripts make complex multi-step recovery behaviors first-class, portable, and debuggable.

6. Failure Taxonomy

The Failure Taxonomy is a structured catalog of known failure modes and their prescribed responses. It is, in effect, a harness-level exception handler written in natural language.

Failure Category	Trigger Condition	Prescribed Response
Context overflow	Context window > 85% capacity	Summarize completed steps, persist summary to file, compress working memory, continue from summary
Constraint violation	Agent action blocked by Contract	Log violation, notify orchestrator, request alternative approach
Tool unavailability	External tool returns connection error	Wait 30 seconds, retry once, if still unavailable mark task as blocked and surface to human
Reasoning loop	Agent repeats the same action three times	Break loop by forcing a plan-revision step before allowing further actions
Scope creep	Agent modifies files outside declared scope	Revert modifications, log violation, re-present scope constraints

The Taxonomy makes failure handling explicit and auditable. When an agent fails in production, the failure mode can be classified against the Taxonomy and the prescribed response can be verified — or the Taxonomy can be extended with a new entry. This is failure handling as a scientific artifact rather than buried exception logic.

How IHR Works

The Intelligent Harness Runtime is a shared engine that interprets and enforces a set of NLAH components at agent runtime. The same IHR instance can service multiple agents simultaneously, each with their own Contract set and Role definitions.

At each agent decision point, the IHR performs a semantic evaluation:

The agent proposes an action based on its reasoning.
The IHR retrieves all applicable Contracts and Stage constraints for the current agent state.
The IHR evaluates whether the proposed action complies with all active constraints.
If compliant, the action proceeds.
If non-compliant, the IHR either redirects the agent with an explanation or invokes the relevant Script or Failure Taxonomy entry.

The IHR acts as a semantic firewall. It does not inspect the model’s internal reasoning or weights — it evaluates the proposed action against externalized rules. The evaluation itself is performed by an LLM (typically a smaller, faster model than the task agent), which interprets the natural-language Contracts in the context of the proposed action.

This semantic evaluation is both NLAH’s strength and its primary performance cost. The IHR adds latency at each decision point. The ablation results from the ICLR 2026 paper showed that this overhead was acceptable: NL-based harnesses achieved equivalent operational performance to code-based harnesses on both coding benchmarks and computer-use tasks, despite the additional evaluation step.

Context Rot: Why Agents Degrade Over Long Tasks

What Context Rot Is

Context rot is not a bug in any specific model — it is a structural consequence of using the context window as working memory over extended agent runs.

As an agent executes a long task, its context window fills with the accumulated trace of everything that has happened: the original task description, tool call results, reasoning steps, self-corrections, intermediate outputs, and re-statements of constraints. Early in a task, this information is dense with signal. After fifty steps, much of it is redundant noise — repeated error messages, tool outputs that have been superseded by subsequent actions, and earlier reasoning that has been overwritten by later decisions.

The signal-to-noise ratio degrades. Eventually, the model starts producing outputs that contradict its own earlier work. It “forgets” a constraint that was established forty steps ago. It repeats work that was already marked complete. It reasons as if it is at the beginning of the task rather than three-quarters of the way through. This is context rot — not a hallucination in the traditional sense, but a coherence failure caused by memory exhaustion.

Context rot is particularly severe for multi-hour tasks, tasks with many tool calls, and tasks that require tracking many interdependent constraints simultaneously. Coding tasks with dozens of files, computer-use tasks involving hundreds of UI interactions, and research tasks spanning multiple domains are all high-risk scenarios.

NLAH’s File-System Memory Solution

NLAH’s approach to context rot is architectural rather than prompting-based. The key insight is that the context window is a poor memory medium: it is fixed-size, it degrades with age, and its contents cannot be selectively retrieved. The file system, by contrast, is unlimited in size, persistent across agent runs, and supports targeted retrieval.

NLAH’s design externalizes agent state to the file system as a first-class principle. Decision trees, completed step logs, discovered constraints, in-progress work, and reasoning traces are all written to structured files that the agent can retrieve on demand. The context window stays lean — containing only the immediate task, the current stage context, and any recently retrieved file contents.

The memory architecture works as follows:

Decision log: every significant agent decision is written to a timestamped log file with the reasoning behind it. When the agent needs to recall a past decision, it retrieves the relevant log entry rather than scanning its context history.
Constraint registry: any constraint discovered during task execution (a dependency that cannot be modified, a test that must stay green, a file that is out of scope) is written to a structured constraint file and checked at each relevant action.
Stage state: the current stage, its entry conditions, and its completion criteria are persisted. Stage transitions write a checkpoint. If an agent crashes mid-task, it can resume from the last checkpoint rather than starting over.
Reasoning trace: the agent’s full reasoning trace — including dead ends and corrections — is written to a human-readable file. This trace is not re-fed to the agent (it would cause context bloat); it is available for human debugging and post-hoc analysis.

The benefit is not merely technical. The reasoning trace as a human-readable file means that any engineer on the team can inspect what the agent was thinking at any point during a long run — without requiring an LLM to interpret it and without access to the agent’s internal state. The harness becomes debuggable by humans.

Connection to AGENTS.md

NLAH’s file-system memory approach mirrors the AGENTS.md pattern at a deeper level. Both reflect the same core insight: persistent external files are more reliable working memory than context window contents.

AGENTS.md stores the harness’s memory about the codebase and its conventions — what patterns to follow, what to avoid, what tools are available. It is the harness’s long-term knowledge, persisted across agent runs and sessions.

NLAH’s runtime files store the agent’s memory about the current task — what has been completed, what decisions were made, what constraints were discovered. They are the agent’s short-term working memory, but externalized so that it does not degrade with context length.

The two patterns are complementary. A production agent system using both would have AGENTS.md providing stable codebase context and NLAH runtime files providing durable task context. The context window handles only what needs to be processed right now.

For a detailed treatment of guides and sensors as a harness memory pattern, see Harness Guides and Sensors.

The Model vs Harness Debate

The Empirical Question

The central unresolved question in agent systems research is: when an agent succeeds or fails, how much of the outcome is attributable to the model and how much to the harness?

The practical stakes are significant. If the model is the dominant factor, teams should invest in the most capable available LLM, even at high cost. If the harness is the dominant factor, a strong harness on a mid-tier model could match or exceed a weak harness on the best available model — at dramatically lower inference cost. If the interaction between model and harness is the dominant factor, then both must be optimized together, and results from either alone are misleading.

The NLAH paper synthesizes a growing body of empirical evidence on this question. The findings are surprising in their consistency.

The Evidence

Finding 1: Claude Code vs. Basic ReAct — 50.7% Win Rate

One study directly compared Claude Code’s sophisticated multi-layer harness against a simple ReAct (Reason + Act) loop implementation using the same underlying model. The expectation was clear: Claude Code’s harness incorporates years of engineering iteration, specialized tool handling, retry logic, and task management. A simple ReAct loop should be no competition.

The result: Claude Code’s harness achieved a 50.7% win rate over the basic ReAct loop. This is not statistically distinguishable from random chance.

The implication is uncomfortable for teams that have invested heavily in harness sophistication. ReAct’s simplicity — reason about what to do, do it, observe the result, repeat — appears to capture most of the useful behavior that complex harnesses provide, at a fraction of the engineering overhead. Advanced features may add marginal benefit in specific edge cases while adding complexity that introduces its own failure modes.

Finding 2: METR Triframe Paradox — 14.5% Win Rate

METR (an AI safety research organization) developed Triframe: a highly specialized, carefully engineered scaffold designed for a specific category of agent task. The expectation was that specialization would produce superior performance — the scaffold was explicitly tuned for the task domain.

When Triframe was compared to a general-purpose scaffold on its target task type, Triframe achieved a 14.5% win rate. The specialized scaffold dramatically underperformed the general scaffold on the tasks it was designed for.

This is the Triframe paradox: investment in specialization produced negative returns. The most plausible explanation is that Triframe’s complexity introduced brittleness. The scaffold had so many moving parts that minor variations in task structure, model outputs, or tool behavior caused cascading failures. The general scaffold, being simpler, was more robust.

The practical lesson: complexity in harness design is not a proxy for quality. Simpler, more general scaffolds appear to be more robust across diverse task conditions.

Finding 3: Stanford Autonomous Harness Improvement

Stanford researchers ran an experiment in which an AI agent was given the task of improving its own evaluation harness — analyzing the harness’s behavior, identifying weaknesses, and rewriting components to perform better.

The autonomously improved harness significantly outperformed Claude Code on TerminalBench 2. The model being evaluated was the same in both cases. Only the harness changed.

This finding has multiple implications. First, it demonstrates that harness quality can dominate model quality — the same model produced dramatically different results under different harnesses. Second, it suggests that AI systems may be capable of improving their own scaffolding in ways that humans have not discovered through manual engineering. Third, it raises a methodological question: if an AI can improve a harness to beat a reference system, what prevents benchmark operators from doing this routinely?

What This Means for Practitioners

The three findings converge on a common conclusion: reported agent benchmark scores represent the combined performance of a model and a harness, and the two cannot currently be separated.

When a paper reports “model X achieves 78% on SWE-bench,” it is reporting the performance of model X under the specific harness used for evaluation. The same model with a different harness might achieve 62% or 91%. Without knowing the harness, the number is not interpretable for practitioners who will deploy a different harness.

NLAH addresses this directly. If harnesses are defined as natural-language contracts interpreted by a shared runtime, then two harnesses can be compared semantically. A practitioner can read both Contract sets, identify the differences, and reason about which differences are likely to affect performance on their specific task. This is impossible with opaque code-based harnesses.

The standardization NLAH provides is not just a convenience — it is the prerequisite for agent systems research to become a reproducible science.

CAR Decomposition: A Framework for Honest Reporting

The Three Components

The CAR decomposition framework, introduced in the NLAH paper, provides a structured vocabulary for reporting agent system results with full attribution.

C — Control (the Harness)

The Control component is everything the harness contributes: the rules, contracts, sensors, and guides that constrain and direct agent behavior. In NLAH terms, this includes the full Contract set, Role definitions, Stage structure, Adapters, Scripts, and Failure Taxonomy. In code-based harness terms, it includes retry logic, error handling, task decomposition strategy, state management, and any AGENTS.md or equivalent guide files.

A — Agency (the Model)

The Agency component is everything the model contributes: which LLM is used, at what version, with what temperature and sampling parameters, with what system prompt, and in what context window configuration. Two agents using the same model ID but different versions, temperatures, or system prompts have different Agency components — and their results should not be treated as equivalent.

R — Runtime (the Execution Environment)

The Runtime component is everything the execution environment contributes: what tools are available, what compute resources are allocated, what file system access the agent has, what network access is permitted, what memory architecture is in use (context only, file system, vector DB, etc.), and what the timeout and resource limits are.

Why All Three Must Be Documented

Consider the claim: “Model X achieves 78% on SWE-bench.” Under CAR decomposition, this claim is incomplete. The full claim should be:

C: Harness = Claude Code scaffold with default settings, no custom Contracts
A: Agency = Model X version 2.1, temperature 0.2, standard system prompt
R: Runtime = Docker container, 32GB RAM, 4 CPU, 4-hour timeout, full file system access, internet access disabled

Without all three components, the result cannot be reproduced, compared, or interpreted. A different team running the same model with a different harness will get a different result — and neither team will understand why.

The benchmark inflation risk is real. As the Stanford autonomous harness improvement finding demonstrated, it is possible to improve benchmark scores significantly by improving only the harness while holding the model constant. If harness details are not published, there is no way to detect whether a score improvement reflects model progress or harness optimization.

Applying CAR to Your Own System

When documenting your own agent system — in internal reports, external publications, or vendor comparisons — use CAR decomposition as a checklist:

Dimension	What to Document	Why It Matters
Control (Harness)	Contract set or equivalent rules; guide files (AGENTS.md); max iteration limits; retry policies; escalation triggers	Determines behavioral constraints regardless of model
Agency (Model)	Model ID and version; temperature; system prompt (or hash); context window size; sampling parameters	Determines capability ceiling for any given task
Runtime (Environment)	Available tools and their versions; file system access scope; memory architecture; compute limits; network access; timeout settings	Determines what the agent can physically do

For multi-agent systems, document CAR for each agent role separately. A system with a Planner agent and an Executor agent has six components to document — not three.

Next Steps

The error radius and shift-left framework provides a complementary lens for thinking about harness design: how do you minimize the blast radius when an agent makes a mistake, and how do you detect mistakes as early as possible in the execution pipeline? See Agent Error Radius and Shift Left for the full treatment.

For the foundational research that NLAH builds on — the original ReAct paper that established the reason-act loop as a baseline — see ReAct: Reasoning and Acting Explained.

For a broader treatment of how agent benchmarks work and how to evaluate systems fairly — including the evaluation framework that CAR decomposition is designed to support — see Agent Evaluation and Benchmarks.

Frequently Asked Questions

Is NLAH production-ready in 2026?

The ICLR 2026 paper presents NLAH as a research contribution and experimental architecture, not a production-ready framework. The ablation results demonstrate that NL-based harnesses match code-based harnesses on performance — but the IHR’s semantic evaluation step adds latency, and the reliability of natural-language enforcement depends on the quality of the LLM interpreting the Contracts. For high-stakes production deployments, teams should treat NLAH as a design paradigm to learn from rather than a framework to adopt wholesale. The six-component architecture (Contracts, Roles, Stages, Adapters, Scripts, Failure Taxonomy) can inform the design of any harness — even one implemented in code — by making its structure explicit and auditable.

Does NLAH require a specific LLM to interpret the natural-language contracts?

No. The IHR is designed to work with any sufficiently capable LLM as its evaluation engine. In practice, the ICLR 2026 paper used a smaller, faster model for Contract enforcement — distinct from the model performing the primary task. The evaluation LLM does not need frontier-level capability; it needs reliable instruction-following and the ability to apply a rule to a proposed action. Smaller instruction-tuned models in the 7B–13B parameter range are viable IHR evaluation engines, which keeps enforcement latency manageable. The task agent and the IHR evaluation agent do not need to be the same model or even the same model family.

How does NLAH handle contract conflicts?

Contract conflicts — where two Contracts prescribe incompatible behaviors — are a real design risk in any declarative rule system. NLAH addresses this through explicit Contract priority ordering: each Contract is assigned a priority level, and when two Contracts conflict on a specific action, the higher-priority Contract takes precedence. The IHR logs the conflict and the resolution. Designing conflict-free Contract sets is part of the harness engineering discipline that NLAH introduces. The Failure Taxonomy can include a “Contract conflict” entry that triggers human review when a conflict is detected — ensuring that unresolved conflicts are surfaced rather than silently resolved by priority alone. In practice, well-designed Contract sets are organized around non-overlapping behavioral domains to minimize conflict risk.

Should I replace my code-based harness with NLAH today?

The evidence from the model-versus-harness debate suggests caution before investing heavily in any harness redesign. The 50.7% win rate for Claude Code’s sophisticated harness versus a basic ReAct loop, and the 14.5% win rate for METR’s specialized scaffold, both suggest that simpler is often better. Before considering NLAH adoption, audit your current harness: which components are actually contributing to task success? Which are adding complexity without measurable benefit? Apply CAR decomposition to document what you have. The most valuable insight from NLAH may not be the architecture itself, but the discipline it represents — making harness logic explicit, readable, and ablatable. You can apply that discipline to a code-based harness by adding documentation, structured configuration, and clear separation between control logic and execution logic, without adopting the full NLAH runtime.