Harness Engineering - Lessons learned from Coding Agents

For years, AI engineering was framed around **models**: which model to use, how to prompt it, how to fine-tune it, how to evaluate it. But as coding agents become more capable, the bottleneck is shifting. The hard part is no longer only “Can the model generate code?” The harder question is: Can we design an environment where agents can reliably understand, act, validate, recover, and improve software systems?

#genai #claudecode #AI Infrastructure

May 20 • From Toan Vo

Introduction

For years, AI engineering was framed around models: which model to use, how to prompt it, how to fine-tune it, how to evaluate it. But as coding agents become more capable, the bottleneck is shifting. The hard part is no longer only “Can the model generate code?” The harder question is:

Can we design an environment where agents can reliably understand, act, validate, recover, and improve software systems?

1. What Is Harness Engineering?

A harness is the control system around an AI agent.

It gives the agent enough structure to do useful work and enough feedback to correct itself.

Harness engineering is the practice of designing the environment, constraints, feedback loops, tools, documentation, validation systems, and operating rules that allow AI agents to do reliable work.

The important sentence is:

Humans steer. Agents execute.

That means the engineer’s role moves upward: from implementing code to defining goals, constraints, validation loops, system structure, and feedback mechanisms.

2. The Bottleneck Is No Longer Code Generation

As code throughput increased, our bottleneck became human QA capacity

Traditional engineering:

Problem → Human writes code → Human tests → Human reviews

Agent-first engineering:

Problem → Human defines intent + constraints → Agent executes → System validates → Agent iterates

So the high-leverage work becomes:

tools + docs + architecture + tests + observability + review loops

3. Repository Knowledge Becomes the System of Record

One of the most important ideas in harness engineering is that the repository becomes the agent’s main source of truth.

Too much guidance becomes non-guidance

A map/ table of content AGENTS.md⁠, not a encyclopedia with 1,000-page instruction manual

AGENTS.mdARCHITECTURE.mddocs/├── design-docs/│   ├── index.md│   ├── core-beliefs.md│   └── ...├── exec-plans/│   ├── active/│   ├── completed/│   └── tech-debt-tracker.md├── generated/│   └── db-schema.md├── product-specs/│   ├── index.md│   ├── new-user-onboarding.md│   └── ...├── references/│   ├── design-system-reference-llms.txt│   ├── nixpacks-llms.txt│   ├── uv-llms.txt│   └── ...├── DESIGN.md├── FRONTEND.md├── PLANS.md├── PRODUCT_SENSE.md├── QUALITY_SCORE.md├── RELIABILITY.md└── SECURITY.md

4. Agent Legibility Matters

From the agent’s point of view, anything it can’t access in-context while running effectively doesn’t exist. Knowledge that lives in Google Docs, chat threads, or people’s heads are not accessible to the system.

5. Architecture Must Be Mechanically Enforced

The article says documentation alone is not enough. OpenAI encoded architecture rules into custom linters, structural tests, naming conventions, file-size limits, logging rules, schema rules, and reliability requirements.

This is one of the most important production lessons.

For agentic software development, “please follow this architecture” is weak.

Better:

The architecture is enforced by CI.Invalid dependencies fail.Bad structure fails.Missing schema validation fails.Bad logging fails.

This means the agent does not need to “remember” every rule. The system catches violations and gives remediation instructions.

That is a powerful idea:

Turn taste into tooling.Turn architecture into tests.Turn review comments into reusable constraints.

6. High-Throughput Agents Change the Merge Philosophy

Because agents can produce and fix PRs quickly, OpenAI says some conventional engineering norms become counterproductive. They use minimal blocking merge gates, short-lived PRs, and cheap follow-up fixes.

This does not mean “merge bad code.”

It means the economics change:

Low-throughput engineering:Mistakes are expensive → block more before merge High-throughput agentic engineering:Corrections are cheap → merge smaller changes, fix continuously

This reminds me of a streaming system: instead of huge batch reviews, they move toward continuous correction.

7. Harnesses must evolve as models improve

Anthropic makes an important point: harnesses encode assumptions about what the model cannot do, but those assumptions can become stale as models improve. They give the example of context-reset logic that was useful for one Claude version but became unnecessary for a stronger model.

This is a big practical lesson:

Do not over-engineer permanent scaffolding around temporary model weaknesses.

A good harness should be modular. You should be able to remove parts when the model becomes better.

8. Better models simplify some harness layers, but do not eliminate harness engineering

Claude Opus 4.7 is presented as better at long-running work, instruction-following, file-system memory, coding workflows, and tool-heavy execution. Anthropic also notes that prompts and harnesses may need retuning because stronger instruction following can change behavior.

So the trend is not:

Better model → no harness needed

It is more like:

Better model → different harness needed

Older harnesses may contain too much babysitting. Newer harnesses should focus more on evaluation, permissions, tool interfaces, observability, and scalable human supervision.

9. AI engineers become system designers, not just model users

The article’s strongest implication is about the role of the AI engineer.

In 2024–2025, many AI engineers focused on:

prompting + RAG + model selection + API integration

In 2026-style agent systems, the work shifts toward:

agent runtime designtool contractsworkflow orchestrationevaluation loopsautomated QAstate/memory managementhuman review designcost/latency controlsfailure recovery

10. Can we automate harness engineering?

Insights from Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses
- The paper’s main idea is: coding-agent performance is not only about the base model. It strongly depends on the harness around the model — tools, middleware, memory, prompts, execution control, and feedback loops. The authors propose Agentic Harness Engineering (AHE), a loop where another agent automatically improves that harness using observed failures and measured outcomes.
- The key claim is that automatic harness improvement is bottlenecked more by observability than by model intelligence. If the evolution agent can clearly see what components exist, what happened in trajectories, and what each edit was supposed to fix, then it can improve the harness in a stable way.
Insights from Meta-Harness: End-to-End Optimization of Model Harnesses
- Core idea: instead of manually designing the “harness” around an LLM, Meta-Harness uses another coding agent to automatically search over harness code. Here, “harness” means the external program that controls what the model stores, retrieves, sees in context, and does across steps.
- The paper argues that LLM system performance is not only a model-weight problem. The surrounding harness can strongly affect performance: prompt construction, memory, retrieval, tool calls, state updates, and execution loop design all matter. Existing prompt/text optimizers are too lossy for this because they usually optimize from scalar scores, short feedback, or summaries. Meta-Harness instead gives the proposer agent access to the full history of previous candidates: source code, scores, execution traces, prompts, model outputs, tool calls, and state updates.

11. Final

The article is not just about harness engineering. It is about a new software development architecture:

Agent = workerRepo = memoryDocs = context mapCI = lawObservability = sensesLinters = taste enforcementExecution plans = long-term memoryHuman = product/architecture/control-plane

The strongest takeaway is:

The quality of agentic software development depends less on one perfect prompt and more on the quality of the environment the agent operates inside.

In the agent-first world:

Humans steer.Agents execute.Harnesses make execution reliable.

The best engineers will not only know how to write code. They will know how to build environments where agents can understand the system, act inside it, validate their work, and continuously improve the codebase without destroying its architecture.

Back to Blog