What Is Harness Engineering? The Missing Layer in AI Agent Design

You built an agent. It works great on the first turn. Then it goes off-rails on turn three, contradicts itself on turn seven, and crashes on turn twelve.

This is not a model problem. It is a harness problem.

Most developers spend 80% of their time picking the right LLM and 20% on everything else. Harness engineering flips that ratio — and it is the reason some agents ship and most do not.

Why Models Alone Are Not Enough

A single LLM call is manageable. You send a prompt, you get a response. The non-determinism is contained: one output, one chance to go wrong.

An agent loop is a different beast entirely.

At 10 sequential steps, each step making its own LLM call, errors do not stay contained — they compound. If each step succeeds 90% of the time (already optimistic), your end-to-end success rate is:

0.9 ^ 10 = 0.35

35% end-to-end success rate. With 90% per-step reliability.

Push to 20 steps — common for any non-trivial task — and you are looking at 12% success. The model has not changed. The math has changed.

This is why sequential agentic tasks fail in ways that single-shot prompting does not. Each mistake is input for the next step. A wrong file path at step 3 produces a wrong tool call at step 4, which produces a malformed result at step 5. By step 8 the agent is confidently executing a plan that became invalid four steps ago.

You need a control layer. That layer is the harness.

The error-compounding problem is not hypothetical. It shows up in every production agent deployment. Teams demo a 3-step agent that looks stunning. They extend it to 15 steps for a real use case. It collapses. They assume the model is not smart enough, switch to a more expensive model, and get a slightly prettier collapse. The problem was never the model. It was the absence of controls around the model.

The Formula: Agent = Model + Harness

Here is the formulation from harness engineering literature:

Agent = Model + Harness

Strip it down:

Model — the LLM reasoning engine. GPT-4o, Claude, Gemini. The thing that reads context and produces tokens.
Harness — the runtime environment that wraps the model and makes it behave reliably over many sequential steps.

Without a harness, you have an LLM. With a harness, you have an agent.

┌────────────────────────────────────┐
│             HARNESS                │
│                                    │
│   ┌──────────────────────────┐     │
│   │         MODEL            │     │
│   │  (LLM: GPT-4o, Claude)   │     │
│   └──────────────────────────┘     │
│                                    │
│   • Tool dispatching               │
│   • Context compression            │
│   • Session persistence            │
│   • Error recovery                 │
│   • Retry logic                    │
└────────────────────────────────────┘

The model handles reasoning. The harness handles everything the model cannot do itself: calling real APIs, managing memory, recovering from failures, and keeping the loop on track.

Think of it this way. A model is a brain in a jar. Brilliant, perhaps. But it cannot move, cannot act, cannot remember what happened five minutes ago unless you tell it. The harness is the body — the nervous system that connects the brain to the world and keeps it functioning over time.

Scaffold vs Harness: A Critical Distinction

These two terms get conflated constantly. They are not the same thing.

Dimension	Scaffold	Harness
When active	Before first prompt	After agent starts running
Nature	Static	Dynamic
Purpose	Initial structure	Runtime orchestration
Examples	System prompt, AGENTS.md, role definition, initial tool list	Tool routing, context management, retry logic, checkpointing
Changes during run?	No	Yes, continuously

Scaffold is what you build before the agent does anything. It defines the initial shape: who the agent is, what tools it has access to, what conventions it follows, what its task scope is. A well-written 100-line AGENTS.md file is a scaffold artifact — it gives the agent a table of contents for its own behavior before it makes a single decision.

Harness is what keeps the agent alive and on track after it starts. Every tool call the agent makes passes through the harness. Every result that comes back gets filtered by the harness. When the context window fills up, the harness decides what to keep. When a tool call fails, the harness decides whether to retry, surface the error, or abort.

The common mistake: teams spend weeks crafting a perfect scaffold and then wire the agent directly to the LLM API with no harness. The first run looks great. A real task with 15+ steps breaks everything.

A scaffold without a harness is a race car with no steering wheel. Impressive on a straight line. Dangerous anywhere else.

What a Harness Does: Three Core Functions

1. Tool Dispatching

The model does not actually call tools. It produces text that looks like a tool call. The harness intercepts that text, validates it, routes it to the right function, executes it, handles any errors, and returns structured results back into the model’s context.

Model output:  { "tool": "search_web", "query": "LangChain vs LlamaIndex" }
       ↓
Harness: validate params → call search API → handle timeout → format results
       ↓
Model input:  [search results injected into context]

Without this layer, a malformed tool call crashes the agent. With it, the harness can retry, reformat, or ask the model to correct its output — often without the agent ever knowing something went wrong.

Tool dispatching is also where parameter validation lives. If a model produces a tool call with a missing required field, or passes a string where an integer is expected, the harness catches it before execution. This is not optional error handling — it is the difference between an agent that degrades gracefully and one that hard-crashes mid-task with no recovery path.

2. Context Compression

Every LLM has a context window. 128K tokens sounds like a lot. A 20-step agent that reads files, calls APIs, and processes results can fill that window faster than you expect.

When the context overflows, three things happen — all bad:

The oldest turns get truncated. The agent loses memory of decisions made early in the task.
The agent contradicts earlier decisions because it cannot see them.
Performance degrades as the model struggles to attend to the relevant parts of a bloated context.

The harness manages this actively. It summarizes completed steps and compresses them. It moves resolved sub-tasks to an external memory store. It keeps the active context window lean and focused on what the current step actually needs.

This is not a nice-to-have. Without context management, any task longer than a few minutes will fail on a long enough run.

One specific failure pattern worth naming: context poisoning. When a failed tool call returns a massive error trace — stack traces, raw HTML, binary garbage — and that result enters the context unfiltered, the model spends attention on noise instead of signal. A harness with context compression strips or summarizes noisy results before they hit the context window. This alone can make a fragile agent reliable.

3. Session Persistence

Real-world tasks take time. A task that fetches data, processes it, generates a report, and posts it somewhere might run for 20 minutes. During that window:

Rate limits can pause execution mid-task
API timeouts can kill individual steps
The process can crash
The machine can restart

A harness with checkpointing saves state at meaningful intervals — completed steps, intermediate results, current position in the task plan. When a failure occurs, the agent resumes from the last checkpoint instead of starting over.

Without this, a 15-minute task that fails at minute 14 restarts from scratch. With it, it picks up at minute 13 and finishes in 60 seconds.

For long-horizon tasks — anything measured in hours rather than seconds — session persistence is not optional.

Two Types of Control: Feedforward and Feedback

A well-designed harness uses two control loops simultaneously. This mirrors a principle from cybernetics: a controller must match the variety of the system it governs (Ashby’s Law). LLM-based agents produce high variety — they can do almost anything, which means they can go wrong in almost any way. The harness must match that variety with layered controls.

Feedforward controls (Guides) steer the agent before it acts. They shape behavior upstream of the decision:

AGENTS.md with task conventions and constraints
System prompts that frame the agent’s role
Task decomposition that structures the problem before the loop starts
Input validation that catches bad data before it enters the context

Feedback controls (Sensors) correct the agent after it acts. They observe outputs and feed results back:

Linter errors returned after a code generation step
Test results that tell the agent whether its code runs
Type checker output that catches type errors in generated functions
Structured result validation that flags malformed tool outputs

[Task Input]
     ↓
[Feedforward: Guides]  ←── AGENTS.md, conventions, task framing
     ↓
[Model Decision]
     ↓
[Action Execution]
     ↓
[Feedback: Sensors]    ←── linter, tests, type checks, validators
     ↓
[Next Model Input]     ←── results + corrections injected into context

The best harnesses run both loops at every step. Feedforward reduces the probability of bad decisions. Feedback catches and corrects the bad decisions that slip through anyway.

For a deep treatment of guides and sensors, see Harness Guides and Sensors: Controlling AI Agent Behavior.

Why Harness Quality Beats Model Choice

Here is a counterintuitive research finding: in controlled evaluations, the Claude Code harness (a sophisticated harness) versus a basic ReAct loop (minimal harness) showed only a 50.7% win rate for the better model. Statistically insignificant.

What does that mean? A better harness on a weaker model can match or outperform a stronger model running in a worse harness.

This has direct practical consequences:

Scenario	Outcome
GPT-4o + no harness	Unreliable. Fails on multi-step tasks.
GPT-3.5 + strong harness	Consistent. Handles 20-step tasks.
Claude + strong harness	Reliable and capable.
GPT-4o + strong harness	Best results.

The model is the ceiling. The harness is the floor. You can have a very high ceiling and a very low floor. Your agents will cluster around the floor.

OpenAI’s own production agents do not rely on elaborate 10,000-line orchestration systems. Their backbone is an AGENTS.md file of roughly 100 lines — a compact, well-structured scaffold that gives the agent clear behavioral guidelines. The harness does the runtime work. The scaffold gives it direction.

Martin Fowler’s assessment: “The harness is where most of the actual engineering happens in agentic systems.” Not the prompt. Not the model choice. The harness.

This has a practical implication for teams evaluating LLMs: benchmark results are almost always measured with the same harness across models. A model that scores 10 points higher on SWE-bench was evaluated with a controlled harness. In production, with a weak harness, that 10-point gap collapses. You are not buying model intelligence — you are buying model potential. The harness determines how much of that potential you can actually use.

The Car Analogy

The model is the engine. A V8, maybe. Powerful. Expensive to upgrade.

The harness is everything else that makes the car drivable:

Engine alone:
  ✓ Generates power
  ✗ No steering
  ✗ No brakes
  ✗ No dashboard
  → Dangerous

Engine + chassis + controls:
  ✓ Generates power
  ✓ Steerable
  ✓ Stoppable
  ✓ Observable
  → Drivable

You can put a V8 engine in a car with no brakes. It will be very fast. It will also be very dangerous. Nobody will use it for anything serious.

Good harness engineering is what makes an LLM drivable — controllable, observable, and safe to run on real tasks with real consequences.

The analogy also explains why model upgrades disappoint teams that skip harness work. Swapping a V6 for a V8 in a car with no brakes does not make the car safer. It makes it faster and more dangerous. The right fix is to install brakes, not upgrade the engine.

Frequently Asked Questions

Is the harness the same as the prompt?

No. The prompt (specifically the system prompt) is part of the scaffold — the static structure that initializes the agent before it runs. The harness is the dynamic runtime that operates after the agent starts. Prompts tell the model who it is and what it should do. The harness actually does the work of routing tool calls, managing context, handling errors, and persisting state across steps. You can have a perfect prompt and a broken harness — and the agent will fail on any non-trivial task.

Do frameworks like LangChain or CrewAI provide a harness automatically?

Partially. Frameworks provide harness infrastructure — tool call routing, chain execution, agent loops. But they do not make harness engineering decisions for you. You still need to decide how to handle context overflow, when to checkpoint state, how to recover from tool failures, and what feedback sensors to connect to the loop. A framework gives you the plumbing. Harness engineering is deciding how to run water through it.

How complex does a harness need to be?

Simpler than you think for most cases. A harness for a single-task agent might be 200 lines: a tool dispatcher, a context trimmer, and a retry wrapper. Complexity scales with task duration, the number of tools, and the stakes of failure. OpenAI’s production harnesses for coding agents use roughly 100-line AGENTS.md files as their primary behavioral control. The principle is: build the minimum harness that makes your failure modes manageable. Add complexity only where a specific failure mode demands it.

When does harness complexity become a problem?

When the harness becomes harder to debug than the model’s behavior. This is a real failure mode. Overly complex harnesses hide whether a failure came from the model or the infrastructure. The symptoms: agents that work in testing and fail in production in ways you cannot reproduce; failures that appear intermittently with no clear cause; debugging sessions that take longer than writing the original agent. The antidote is to keep harness components small, testable, and independently observable. Each component — dispatcher, compressor, checkpoint manager — should be debuggable in isolation.

Next Steps

This article introduced harness engineering at a conceptual level. The next articles go deeper into each layer:

Harness Guides and Sensors: Controlling AI Agent Behavior — a detailed treatment of feedforward and feedback control loops, with implementation patterns for each
Prompt Engineering for AI Agents — how to write system prompts and task prompts that work with a harness rather than against it
What Is a Multi-Agent System? — how harness engineering scales when you have multiple agents coordinating on shared tasks