Harness Engineering: The OS Layer Powering AI Agents

A harness is every piece of code, configuration, and execution logic that is not the model itself. A raw AI model is not an agent. It only becomes one when a harness gives it state, tool execution, feedback loops, and enforceable constraints. The model contains the intelligence. The harness makes that intelligence useful.

Think of a raw model as a powerful CPU with no RAM, no disk, and no way to talk to the outside world. The context window acts like RAM, fast but limited. External databases act like disk storage, large but slow. Tool integrations act like device drivers. The harness is the operating system that ties it all together.

This is not a wrapper around a prompt. It is the complete system that makes autonomous agent behavior possible.

Why We Need Harnesses

Models can only take in data like text, images, or audio and output text. That is it. Out of the box, they cannot maintain a durable state across interactions, execute code, access real-time knowledge, or set up environments to complete work. These are all harness-level features.

For example, to get a product experience like chatting, we wrap the model in a loop to track previous messages and append new user messages. Everyone has already used this kind of harness. The core idea is to convert a desired agent behavior into an actual feature in the harness.

The OpenAI Experiment: Zero Human Code

Over five months, an OpenAI team ran an experiment. They built and shipped a software product where not a single line of code was written by a human. Every line, from application logic to tests to documentation, was written by Codex. A team of three engineers managed about 1,500 pull requests and reached roughly one million lines of code. They did this in about one-tenth of the time it would have taken to write by hand.

The lesson was clear. Human engineers did not write code. They designed the environment, clarified intent, and built feedback loops so the agent could work reliably. When progress stalled, the fix was never to try harder. It was to ask what capability the agent still lacked and how to make that capability clear and enforceable.

The Five Subsystems of Every Harness

Every effective harness has exactly five parts that work together. The first is instructions. This tells the agent what to do, in what order, and what to read first. The key files are AGENTS.md, CLAUDE.md, and the docs/ directory. The second is the state. This tracks what is done, what is in progress, and what comes next. It is saved to disk, so the next session picks up exactly where the last one left off. Key files include claude-progress.md, feature_list.json, and the git log. The third is verification. Only passing tests counts as evidence. The agent cannot declare victory without proof. This includes tests, linters, type-checkers, and smoke runs. The fourth is scope. This constrains the agent to one feature at a time, preventing overreach and half-finished work. The fifth is the session lifecycle. This handles initialization at the start, cleanup at the end, and leaving a clean restart path for the next session.

Your Minimal Harness: Four Files to Drop In Today

You do not need a million-line system to start. A minimal harness only needs four files in your project root. Here is what that looks like:

YOUR PROJECT ROOT
├── AGENTS.md              # the agent's operating manual
├── init.sh                # runs install + verify + health check
├── feature_list.json      # what features exist, which are done
├── claude-progress.md     # what happened each session
└── src/                   # your actual code

The AGENTS.md file is the operating manual. It tells the agent what to do before starting any work, what rules to follow, and how to verify completion.

# Agent Instructions

## Before Starting Any Work
1. Run `./init.sh` to verify environment health
2. Read `claude-progress.md` for context from last session
3. Read `feature_list.json` to see what's done and what's next
4. Check `git log --oneline -10` for recent changes

## Rules
- Work on exactly ONE feature at a time
- Never declare "done" without passing tests
- Run the full test suite before committing
- Update `claude-progress.md` after every session
- Update `feature_list.json` when a feature status changes
- Commit only when the project is in a clean, resumable state

## Verification Checklist
- [ ] All tests pass
- [ ] Linter passes
- [ ] Type-check passes
- [ ] Feature works as specified

The init.sh script is the environment health check. It installs dependencies, runs tests, and checks types before any work begins.

#!/bin/bash
set -e
echo "=== Installing dependencies ==="
npm install
echo "=== Running tests ==="
npm test
echo "=== Type checking ==="
npx tsc --noEmit
echo "=== Environment healthy ==="

The feature_list.json file is a machine-readable scope. It lists every feature, its status, and where to find its tests.

{
  "features": [
    { "id": "F001", "name": "User login", "status": "done", "tests": "src/auth.test.ts" },
    { "id": "F002", "name": "Document import", "status": "in-progress", "tests": "src/import.test.ts" },
    { "id": "F003", "name": "Search", "status": "not-started", "tests": null }
  ]
}

The claude-progress.md file is session memory. It records what happened in each session so the next one can pick up where the last left off.

# Progress Log

## Session 3 — 2026-04-20
- Completed: F001 (user login) — all tests passing
- In progress: F002 (document import) — parser done, validation pending
- Blocked: none
- Next session should: finish F002 validation logic, then run full test suite

Building the Agent's Map

One of the biggest challenges is context management. Early on, the OpenAI team learned a key lesson: give the agent a map, not a 1,000-page manual. A huge instruction file wastes the context window, becomes outdated quickly, and is hard to verify. Instead, they treated AGENTS.md as a table of contents. The real knowledge lived in a structured docs/ directory inside the code repository.

This directory held design documents, execution plans, product specs, and reference materials. Plans were treated as first-class artifacts, version-controlled and centrally stored. A short AGENTS.md file pointed the agent to the right place. This is called progressive disclosure. The agent starts with a small, stable entry point and is guided to deeper information only when needed.

Here is what a minimal AGENTS.md Looks like in practice:

# AGENTS.md

## Tech Stack
- Python 3.12 / FastAPI / PostgreSQL 16 / React 18

## Running Tests
- Backend: `pytest tests/ -x --tb=short`
- Frontend: `npm test -- --watchAll=false`

## Architecture
- See `/docs/architecture/` for system design
- See `/docs/api/` for endpoint contracts
- Dependency direction: data → service → api → ui (never reverse)

## Forbidden Patterns
- Do NOT use `requests` library; use `httpx` with async
- Do NOT import from `ui/` in any `services/` module
- Do NOT add dependencies without updating `/docs/deps.md`

## Common Mistakes
- Tests must run against the test DB, not production
- Always validate input shapes at API boundaries

The Agent Session Lifecycle

Every session follows the same lifecycle. The harness governs every transition. The model decides what code to write at each step.

START
  1. Agent reads AGENTS.md
  2. Agent runs init.sh (install, verify, health check)
  3. Agent reads claude-progress.md (what happened last time)
  4. Agent reads feature_list.json (what's done, what's next)
  5. Agent checks git log (recent changes)

SELECT
  6. Agent picks exactly ONE unfinished feature
  7. Agent works ONLY on that feature

EXECUTE
  8. Agent implements the feature
  9. Agent runs verification (tests, lint, type-check)
  10. If verification fails → fix and re-run
  11. If verification passes → record evidence

WRAP UP
  12. Agent updates claude-progress.md
  13. Agent updates feature_list.json
  14. Agent records what's still broken or unverified
  15. Agent commits (only when safe to resume)
  16. Agent leaves clean restart path for next session

Without the harness, step 9 becomes "agent says it looks fine." With the harness, step 9 is "tests pass, lint is clean, types check."

Making the App Readable to the Agent

As code output grew, the bottleneck became human quality assurance. Since human time is fixed, the team made the application itself readable to Codex. They connected Chrome DevTools so the agent could take DOM snapshots, screenshots, and navigate the UI. This lets the agent reproduce bugs, verify fixes, and reason about interface behavior on its own.

They did the same with observability. Logs, metrics, and traces were exposed through a local stack that Codex could query using LogQL and PromQL. With this context, prompts like "make sure the service starts within 800 milliseconds" or "no span in these user journeys should exceed two seconds" became actionable tasks the agent could handle alone.

You can start small. Think about the three most common "is this actually working?" checks your team runs manually. Script them. Make the output machine-parseable. These do not need to be fancy.

# test-visual.sh — give agents eyes
#!/bin/bash
npx puppeteer screenshot http://localhost:3000 --output /tmp/ui-check.png
echo "Screenshot saved. Check for missing elements or layout breaks."

# check-health.sh — give agents a pulse check
#!/bin/bash
for endpoint in /api/health /api/users /api/orders; do
  STATUS=$(curl -s -o /dev/null -w "%{http_code}" "http://localhost:8000$endpoint")
  echo "$endpoint → $STATUS"
done

Scripts like these take ten minutes to write and give agents a huge amount of signal they did not have before.

Mechanical Enforcers: Rules the Agent Cannot Break

You cannot prompt an AI to "have good taste." Subjective standards must become objectively enforced rules. A bad harness uses a prompt saying "please follow clean architecture principles." A good harness uses a custom linter that instantly fails the build when services/ imports from ui/, and injects remediation instructions back into the agent's context.

The enforcer does not just reject. It coaches. The agent reads the remediation, self-corrects, and tries again. The same principle applies to security and permission boundaries. A production harness needs to define what the agent is allowed to touch with mechanical rigidity. In practice, a simple default is: read-only access by default, write access only in sandbox or dev, and human approval for anything that touches production systems, secrets, or deploy pipelines.

Here is a hook script that runs formatting and type checks when the agent stops. On success, it is silent. On failure, only the errors are surfaced, and the exit code tells the harness to re-engage the agent so it fixes them before finishing.

#!/bin/bash
cd "$CLAUDE_PROJECT_DIR"

# prebuild generates types and builds internal SDK packages so typecheck has
# everything it needs. runs bun install afterward to pick up any new generated files.
PREBUILD_OUTPUT=$(bun run generate-cache-key && turbo run build --filter=@humanlayer/hld-sdk && bun install 2>&1)
if [ $? -ne 0 ]; then
  echo "prebuild failed:" >&2
  echo "$PREBUILD_OUTPUT" >&2
  exit 2
fi

# biome and typecheck run in parallel to keep the feedback loop tight.
# one quirk: biome --write exits with code 1 if it made any changes, even if it
# successfully fixed everything. so we run it twice with ||: if the first pass
# makes changes and exits 1, the second pass will exit 0 since there's nothing
# left to fix. if there are unfixable errors, both passes fail and exit 2.
OUTPUT=$(bun run --parallel \
  "biome check . --write --unsafe || biome check . --write --unsafe" \
  "turbo run typecheck" 2>&1)

if [ $? -ne 0 ]; then
  echo "$OUTPUT" >&2
  exit 2
fi

The Core Components of a Harness

A production harness has several distinct parts that work together. The orchestration loop is the heartbeat. It runs a cycle of thought, action, and observation. The loop assembles the prompt, calls the model, parses the output, executes any tool calls, feeds results back, and repeats until the task is done.

Here is the minimal loop in code:

def agent_loop(messages):
    while True:
        response = client.chat.completions.create(
            model=MODEL,
            messages=[{"role": "system", "content": SYSTEM}] + messages,
            tools=TOOLS,
            max_completion_tokens=4096,
        )
        msg = response.choices[0].message
        messages.append({"role": "assistant", "content": msg.content,
                         "tool_calls": msg.tool_calls})

        if not msg.tool_calls:
            return

        for tool_call in msg.tool_calls:
            args = json.loads(tool_call.function.arguments)
            output = TOOL_HANDLERS[tool_call.function.name](**args)
            messages.append({"role": "tool",
                             "tool_call_id": tool_call.id,
                             "content": str(output)})

Tools are the agent's hands. They are defined as schemas with names, descriptions, and parameter types, injected into the model's context so it knows what is available. The tool layer handles registration, validation, sandboxed execution, and formatting results back into readable observations.

Memory operates at multiple timescales. Short-term memory is the conversation history within a single session. Long-term memory persists across sessions through files like AGENTS.md or structured memory stores.

Context management is where many agents fail silently. Model performance degrades when the context window fills up with noise. Strategies include compaction, which summarizes history when approaching limits, and just-in-time retrieval, which loads data dynamically rather than dumping everything into the prompt at once.

State management tracks progress across sessions. Some systems use typed dictionaries flowing through graph nodes. Others use git commits as checkpoints and progress files as structured scratchpads.

Error handling matters because a 10-step process with 99 percent success per step still has only about 90 percent end-to-end success due to compounding failures. Good harnesses distinguish transient errors, recoverable errors, and errors that need human input.

Guardrails and safety enforce boundaries. Input guardrails run on the first agent, output guardrails run on the final result, and tool guardrails run on every invocation. A tripwire mechanism can halt the agent immediately when triggered.

Verification loops separate toy demos from production agents. Options include rule-based feedback like tests and linters, visual feedback like screenshots for UI tasks, and using a separate model as a judge. Giving the model a way to verify its own work can improve quality by two to three times.

Subagent orchestration handles complex tasks by splitting them across multiple agents. Models include fork copies, teammate agents that communicate through files, or isolated worktrees with their own branches.

How the Cycle Works

Here is what one full cycle looks like. First, the harness constructs the full input from the system prompt, tool schemas, memory files, conversation history, and the current user message. Important context is placed at the beginning and end of the prompt because models pay more attention to those positions.

Second, the assembled prompt goes to the model API. The model generates output tokens, which could be text, tool call requests, or both.

Third, the harness classifies the output. If there are no tool calls, the loop ends. If tool calls are requested, the harness moves to execution. If a handoff was requested, it updates the current agent and restarts.

Fourth, for each tool call, the harness validates arguments, checks permissions, executes in a sandboxed environment, and captures results. Read-only operations can run at the same time. Operations that change state run one after another.

Fifth, tool results are formatted as messages that the model can read. Errors are caught and returned so the model can self-correct.

Sixth, results are appended to the conversation history. If the context window is nearly full, the harness triggers compaction.

Seventh, the loop returns to step one and repeats until termination.

Termination conditions are layered. The loop stops when the model produces a response with no tool calls, the maximum turn limit is hit, the token budget runs out, a guardrail fires, the user interrupts, or a safety refusal is returned. A simple question might take one to two turns. A complex task can chain dozens of tool calls across many turns.

The Real Lesson

Two products using the exact same model can have wildly different performance based solely on harness design. LangChain proved this when they changed only the infrastructure around their model and jumped from outside the top 30 to rank 5 on a coding benchmark. A separate research project hit a 76 percent pass rate by having a model optimize the infrastructure itself, beating hand-designed systems.

The harness is not a solved problem. It is where the hard engineering lives: managing context as a scarce resource, designing verification loops that catch failures before they compound, building memory systems that provide continuity without hallucination, and making bets about how much scaffolding to build versus how much to leave to the model.

The next time an agent fails, do not blame the model. Look at the harness.

Resources

OpenAI — Harness Engineering: Leveraging Codex in an Agent-First World
LangChain — The Anatomy of an Agent Harness

The Invisible Layer: How Harness Engineering Is Becoming the Operating System for AI Agents