Intermediate Harness 20 min read

Harness Guides and Sensors: Controlling AI Agent Behavior

#harness #guides #sensors #feedforward #feedback #linter #fitness-function #cybernetics
📚

Read these first:

Every control system — from a thermostat to a self-driving car — operates on some version of the same architecture: something that steers behavior before an action occurs, and something that measures results after. Harness engineering for AI agents follows the same pattern. Understanding this two-loop structure is the foundation of building agents that don’t drift, hallucinate, or silently produce broken output.

This article breaks down the two core mechanisms of a harness — Guides and Sensors — and shows you how each one works, where it belongs in your system, and how they compose into a production-ready control loop.

The Two Control Loops

Before diving into specifics, it helps to see the architecture as a whole.

GOAL → [GUIDES] → Agent Action → [SENSORS] → Observed Result
          ↑                            |
          |_____Self-correction________|

Guides operate in the feedforward loop. They shape what the agent does before it takes any action. A well-written guide means the agent arrives at the right answer on the first attempt, without needing correction at all.

Sensors operate in the feedback loop. They observe what the agent actually produced, compare it against a standard, and — crucially — deliver structured information back to the agent so it can self-correct.

The analogy is physical:

  • Guides are like a steering wheel. You set the direction before you enter the turn, not while you are already going off the road.
  • Sensors are like ABS brakes. They activate after wheel lock is detected, reading conditions in real time and making corrections to restore control.

The best harnesses run both loops continuously and simultaneously. A strong guide reduces how often sensors need to fire. Strong sensors catch the cases that guides miss. Neither alone is sufficient at production scale.

Guides: Steering Before the Agent Acts

Guides are everything you put in front of the agent — before it generates output — that shapes what it will attempt and how. The goal of a guide is simple: maximize the probability that the agent produces correct output on the first try.

What Counts as a Guide

Guides take many forms. Not all of them are obvious.

System prompts and AGENTS.md files are the most explicit guides. They contain behavioral rules, conventions, and constraints that the agent reads before any task begins. If you want the agent to never import a deprecated library, that rule belongs in a guide.

Task framing is a guide. How you describe the work to be done shapes what the agent attempts. “Refactor this function to be more readable” and “Extract a helper function from lines 12–18 and add a docstring” describe the same goal but produce very different agent behavior.

Convention files are guides. Code style guides, API design principles, naming conventions — any document that defines the expected shape of good output acts as a feedforward control. If the agent can read your .eslintrc before it writes code, it is more likely to write code that passes linting without correction.

Bootstrapping instructions are guides. Instructions like “always check if a helper function already exists before creating a new one” or “search the codebase for existing patterns before introducing a new pattern” front-load reasoning that the agent would otherwise skip.

The AGENTS.md Pattern

OpenAI’s internal production harness uses a file called AGENTS.md as its primary guide. The implementation detail that matters most is not that the file exists, but how it is structured: approximately one hundred lines, formatted as a table of contents.

Each entry in the file addresses one topic and provides a first-line summary plus a pointer to deeper documentation. The agent reads the overview, determines which subtopics are relevant to the current task, and retrieves fuller context on demand. This is intentionally different from a ten-thousand-line system prompt.

The reason the shorter file works better is not intuitive until you see agents fail with long prompts. Language models have finite attention. Instructions buried three thousand tokens into a system prompt receive dramatically less weight than instructions at the top. An agent that receives a monolithic ten-thousand-line guide will effectively ignore most of it, selectively attending to the parts near the beginning and end.

The table-of-contents approach solves this by putting navigation at the top and deferring detail until the agent needs it. The agent’s first action with any new task is to consult the index, identify what is relevant, and then retrieve the relevant detail. This mirrors how an experienced human engineer uses documentation: they know where to look, not what every page says.

What to put in AGENTS.md:

  • Project structure overview (where key directories and files live)
  • Key conventions (“all database access goes through the repository layer”)
  • Ordering rules (“always run tests before committing”)
  • Known pitfalls (“this codebase uses a custom logger — do not use print statements”)
  • Links or paths to deeper specification documents

What not to put in AGENTS.md:

  • Exhaustive API documentation (the agent can retrieve that)
  • Long explanations of how the codebase history evolved
  • Anything that does not change how the agent behaves on the current task

Guide Strength Ranking

Not all guides carry equal weight. The following table ranks guide types by their practical influence on agent behavior, from strongest to weakest.

Guide TypeStrengthWhy
System prompt (early in context)HighestAgent reads first; anchors all subsequent decisions
AGENTS.md with task-specific rulesHighPersistent, structured, easily updatable across sessions
Task description framingMediumShapes initial interpretation of the goal
Few-shot examples in promptMediumDemonstrates desired behavior concretely
Implicit naming conventionsLowAgent may not notice or prioritize these

Position in context window matters enormously. Instructions that appear early — especially in the system prompt — are weighted most heavily. Instructions buried late in a long context lose influence. This is not a bug; it is a property of how attention works. Design your guides accordingly.

Common Guide Mistakes

Too long. The practical limit for effective instruction absorption is roughly two thousand tokens of dense prose. Beyond that, the agent’s adherence to specific rules degrades noticeably. If your guide has grown past this threshold, restructure it as a table of contents pointing to sub-documents.

Too vague. “Write clean, maintainable code” is not a guide — it is a preference without specification. A guide must define what “clean” means in your specific context: function length limits, naming patterns, error handling requirements, import restrictions. The more concrete the rule, the more reliably the agent follows it.

Contradictory. Two guides that conflict produce arbitrary behavior. “Always add comprehensive error handling” paired with “keep all functions under twenty lines” will create situations where the agent must choose one or the other. Resolve conflicts in the guide itself, or specify a priority order when conflicts arise.

Missing. The absence of a guide is not neutral — it is a decision to let the agent make all choices from its pretraining defaults. For most production use cases, pretraining defaults produce inconsistent output that does not match your codebase conventions. Some guide is almost always better than none.

Sensors: Correcting After the Agent Acts

Sensors observe what the agent produced and deliver structured feedback that enables self-correction. The key design goal for a sensor is not just that it detects a problem — it is that the feedback it generates is immediately actionable by an LLM.

What Counts as a Sensor

Linter output is a sensor. When a linter reports “Line 42: undefined variable ‘foo’”, that message contains a file path, a line number, and a description of the problem. A language model can parse that format directly and produce a targeted fix.

Type checker errors are sensors. TypeScript’s compiler output and mypy’s error messages both follow consistent formats that LLMs have been trained on extensively. A type error that specifies which variable has the wrong type, on which line, with what expected type, gives the agent exactly what it needs to self-correct.

Test runner results are sensors. “5 tests passed, 2 failed” followed by the failure messages and expected vs. actual values is structured feedback that directly maps to code changes the agent should make.

Build output is a sensor. Compilation errors with file references and line numbers are often the fastest path from broken to working — the agent can iterate on compilation errors extremely quickly.

Runtime logs and stack traces are sensors. The actual error output from running code tells the agent what happened at execution time, which is often different from what static analysis can catch.

UI snapshots are sensors. When Chrome DevTools Protocol captures a DOM snapshot or screenshot, the agent can compare the actual rendered state against the intended state. This is the only practical sensor for catching visual layout bugs autonomously.

Metric query results are sensors. Querying a monitoring system to verify that “service start time is under 800ms” and receiving a concrete measurement is a fitness function sensor — it tells the agent whether the system meets a performance target.

Why Linter Output Is the Strongest Sensor

Among all sensor types, linter output consistently delivers the highest return for effort invested. Three properties make it exceptional.

Format alignment. LLMs have processed enormous amounts of linter output during training. The format — file path, line number, rule name, description — is deeply familiar. The agent does not need to interpret a novel format; it pattern-matches immediately.

Determinism. The same code always produces the same linter output. This consistency means the agent can predict whether its correction will resolve the issue. Non-deterministic sensors make self-correction harder because the agent cannot verify its reasoning.

Zero marginal LLM cost. A linter runs on CPU in under a second. Feeding that output back to the agent costs only the tokens in the linter message, not an additional LLM call. Compare this to an inferential sensor that requires running a second LLM to evaluate the first LLM’s output.

If you are building your first harness and can only add one sensor, add a linter and pipe its output back to the agent.

Computational vs Inferential Sensors

Sensors fall into two categories with very different cost and capability profiles.

Sensor TypeExamplesDeterministicCostSpeedWhat It Catches
ComputationalLinter, type checker, unit testsYesCheapFast (< 5 seconds)Format errors, type violations, logic bugs covered by tests
InferentialAI code review, LLM-as-judgeNoExpensiveSlow (10–60 seconds)Semantic quality, design patterns, intent alignment, security nuance

Computational sensors are the backbone of a harness. They are cheap enough to run on every change the agent makes. They catch the majority of errors — studies on developer workflows consistently show that linting and type checking alone catch sixty to eighty percent of bugs before testing. Gate agent output on computational sensors first.

Inferential sensors provide semantic judgment that computational tools cannot. A linter cannot tell you whether a function’s design is too complicated, whether an API is likely to be confusing to callers, or whether a security fix actually addresses the root cause rather than just the symptom. For these judgments, you need an LLM-as-judge: a second model call that evaluates the first model’s output against a rubric.

The practical rule: run computational sensors on every agent action. Run inferential sensors on significant changes — large refactors, new module introductions, security-sensitive code paths — not on every iteration.

Sensor Placement Strategy

Where you place sensors in the agent’s action loop determines how much correction cost you pay and how quickly errors are caught.

Front-load cheap sensors. A linter that catches a syntax error in under one second prevents the agent from executing a failing test suite for thirty seconds, then running a build for two minutes, then failing a deployment. Cheap sensors at the front of the pipeline save all the downstream cost.

Layer sensors by cost. Run linting first, type checking second, unit tests third, integration tests fourth, inferential review only on significant changes. Each layer only runs if the previous layers pass. This keeps the total cost of sensor evaluation low while maintaining comprehensive coverage.

Never run inferential sensors on every keystroke. An LLM-as-judge call costs real money and takes real time. If your harness triggers an AI code review on every line the agent writes, you will burn through budget quickly and introduce latency that makes the agent feel sluggish. Reserve inferential sensors for checkpoints in the workflow, not for continuous evaluation.

Cybernetics and Fitness Functions

The guide-and-sensor framework has a theoretical foundation in cybernetics — the study of regulatory systems. Understanding this foundation helps you design harnesses that are robust rather than brittle.

Ashby’s Law of Requisite Variety

William Ross Ashby, one of the founding theorists of cybernetics, articulated what became known as the Law of Requisite Variety: a controller must possess at least as many distinct states as the system it controls in order to fully regulate it.

Applied to AI agents: a codebase has near-infinite possible states. Files can be added, removed, or modified in countless combinations. Dependencies can be introduced, removed, or versioned in any direction. Configuration values can be set to any value. An agent with unrestricted freedom can produce near-infinite variations in that codebase — some of them correct, many of them subtly broken.

A harness that can only detect binary pass/fail (the code either builds or it doesn’t) has far fewer states than the space of possible agent outputs. It will fail to regulate most of the ways an agent can drift from desired behavior.

Fitness functions are the harness’s answer to Ashby’s law. Each fitness function adds new states to the harness — new dimensions along which agent output can be evaluated. More fitness functions mean more harness variety, which means more ability to regulate agent behavior toward correct outcomes.

Fitness Functions as Automated Quality Gates

A fitness function is an automated check that evaluates whether the agent’s output meets a specific quality criterion. Unlike a simple test that verifies correctness of a specific function, a fitness function evaluates architectural, structural, or operational properties of the output as a whole.

Architecture fitness functions enforce structural constraints on the codebase. Examples: “no module should import from more than three layers up the dependency tree” or “all database queries must go through the repository layer, never from controller functions directly.” These functions catch violations that individual unit tests cannot see because they require reasoning about the whole codebase structure.

Performance fitness functions enforce measurable speed and resource constraints. Examples: “all API endpoints must respond in under two hundred milliseconds on the canonical test dataset” or “memory usage during startup must not exceed five hundred megabytes.” These functions catch regressions that only appear under load.

Logging and observability fitness functions enforce operational standards. Examples: “every service function must emit a structured log event at entry and exit” or “all database queries must include a correlation ID in their log context.” These functions ensure that when something goes wrong in production, the data needed to diagnose it already exists.

Security fitness functions enforce safety constraints. Examples: “no hardcoded credentials anywhere in the repository (regex scan)” or “all HTTP endpoints must have authentication middleware registered.” These functions catch the class of errors that are too consequential to leave to agent judgment.

The key property of a fitness function is that it generates structured output when it fails. “Function process_order in services/order.py does not log at entry (required by logging standard)” is actionable. “The code has logging problems” is not. Design your fitness functions to produce failure output in the same format as linter output: specific, located, and immediately actionable.

The OpenAI Production Example

OpenAI’s published description of their internal agent harness illustrates how these principles combine in practice.

Per-worktree application instances mean the agent can boot the actual application from within each Git worktree and verify that it starts, runs, and behaves correctly in isolation. This is not a simulation of the runtime environment — it is the actual runtime environment, instantiated cheaply per task. The sensor is reality itself.

Chrome DevTools Protocol integration gives the agent a visual sensor. When the agent modifies frontend code, it can take a DOM snapshot or screenshot and observe what the changes actually produced in the browser. This is the only reliable way to catch layout and rendering bugs autonomously — no amount of static analysis can substitute for seeing what the browser renders.

LogQL and PromQL queries give the agent operational sensors. After a service change, the agent can query the monitoring system directly: “Did the service start in under eight hundred milliseconds?” If the answer is no, the fitness function fails, the structured failure is returned to the agent, and the agent self-corrects. The agent does not need a human to look at a dashboard — the harness can verify operational behavior autonomously.

Together, these components represent a harness that does not merely check whether code is syntactically correct. It checks whether the code actually works — boots, renders, and meets performance targets — all without human involvement.

Building Your First Harness

Understanding the theory is valuable. Getting something running is more valuable. Here is a concrete path from zero to a functional harness.

Minimal Viable Harness

A minimal viable harness requires three components. Each can be implemented in an afternoon.

Component 1: A system prompt with AGENTS.md conventions (Guide). Write a file that describes your project structure, your key conventions, and your known pitfalls. Keep it under one hundred lines. Reference it in your agent’s system prompt. This alone will eliminate the majority of the hallucinated conventions and arbitrary decisions that agents make without guidance.

Component 2: Linter output piped back to the agent (Sensor). After the agent writes code, run your linter against the output and return the results to the agent. If there are errors, include them in the next prompt. If there are no errors, confirm that the linter passed. This closes the first feedback loop and handles a large fraction of correctness issues automatically.

Component 3: Test runner output piped back to the agent (Sensor). After the linter passes, run your test suite and return the results. Failing tests, with their error messages and stack traces, give the agent the information it needs to identify what its code broke. Passing tests give the agent a concrete signal that the change is correct.

With these three components, you have a harness that guides the agent before it acts and corrects it after. Most production agents can operate effectively within this structure.

Upgrade Path

Once your minimal viable harness is stable, extend it in order of value.

Add fitness functions as the next priority. Identify the two or three quality properties that matter most for your specific codebase — often architecture structure, performance targets, or security constraints — and write automated checks for them. Wire their output into the agent feedback loop.

Add an inferential sensor for significant changes. Configure a second LLM call that evaluates large refactors or new module introductions against a code quality rubric. Run it at commit time, not on every change.

Add UI verification if you have frontend components. Integrate a headless browser and snapshot comparison so the agent can verify visual output rather than relying purely on DOM correctness.

Add human-in-the-loop gates for risky operations. Some categories of change — database migrations, security-sensitive code, public API modifications — benefit from requiring human approval before the agent proceeds. This is not a failure of the harness; it is the harness correctly identifying its own limits.

Frequently Asked Questions

How many sensors are too many?

There is no fixed limit on the number of sensors, but there is a practical constraint: the total evaluation time must remain short enough that the agent can iterate within a reasonable time budget. If your sensor suite takes twenty minutes to run, the agent will make four changes per hour. If it takes thirty seconds, the agent makes one hundred and twenty changes per hour.

The right number of sensors is determined by what you can afford in evaluation time and token cost. Front-load cheap sensors (linter, type checker) that run in seconds. Add more expensive sensors (integration tests, inferential review) only at checkpoints where their benefit justifies the cost. Most production harnesses find a stable configuration with three to five computational sensors and one inferential sensor run periodically.

What should AGENTS.md contain for a Python project?

For a typical Python project, an effective AGENTS.md includes: the top-level directory layout and what each directory contains, the import conventions (where to put new modules), the error handling pattern your project uses (custom exception classes, logging format), any third-party libraries that are preferred over standard library alternatives, the testing conventions (test file location, fixture patterns), and known gotchas specific to your codebase (deprecated modules, unusual configuration).

What it should not include: complete API documentation for your libraries, detailed explanations of business logic, or historical context about why the codebase is structured the way it is. Keep the file as a navigation layer, not a reference manual.

Can I use the same harness for different LLM models?

The guide components of a harness (AGENTS.md, system prompts, task framing) are largely model-agnostic and work across different LLMs without modification. The sensor components are entirely model-agnostic — linter output is linter output regardless of which model receives it.

Where you may need to tune per-model is in the format and length of your guides. Some models respond better to bullet-point conventions; others to prose explanations. Some models can follow long guides reliably; others lose track past a certain length. Treat the guide structure as a parameter to experiment with rather than a fixed artifact. The sensors remain constant.

What’s the ROI of adding an inferential sensor?

The return on an inferential sensor depends on the error class you are trying to catch. For semantic quality — poorly designed abstractions, confusing API choices, subtle logic errors that tests don’t cover — an LLM-as-judge can catch issues that no computational sensor will detect. These are often the most expensive bugs to fix later.

The cost is real: an inferential sensor adds LLM API cost and latency to every evaluation cycle. The break-even point depends on how frequently the agent introduces semantic quality issues and how expensive those issues are to fix downstream. Most teams find that running inferential sensors on significant changes (large refactors, new modules) rather than every agent iteration gives a favorable ratio of caught issues to evaluation cost. Start with computational sensors, measure the residual error rate, and add an inferential sensor when you can identify a specific class of errors that computational tools are missing.

Next Steps

With the guide-and-sensor model in hand, the next topic to explore is how to scope what the agent is allowed to touch. Every harness should define an error radius — the boundary around which an agent’s changes can have unintended effects. Narrowing the error radius through shift-left techniques is covered in Agent Error Radius and Shift-Left Testing.

For a deeper look at how guides are constructed at the prompt level, Prompt Engineering for AI Agents covers the mechanics of writing guides that reliably shape agent behavior across different task types.

If you are building agents within a specific framework context, OpenClaw Security and Sandbox shows how guide-and-sensor harness principles apply to sandboxed execution environments where the agent’s actions have real-world effects.

Related Articles