The Agent Is Not the System

I was happily having my agents write cold outreach email drafts for me.

Then: Oh great... another em dash (—).

ARRGHSKFJHF...

That was my reaction after my system kept producing batches of outreach emails with the same little tell in every draft.

In the first batch, I edited them out by hand.

After the second batch, I added a rule to the prompt: no em dashes. Ever.

That helped for a while. Then the same pattern came back in longer sessions, larger batches, and fresh agent runs. The rule was in the prompt, but the system did not reliably remember what we had already learned.

At that point, the question is not only: how do I prompt the agent better?

It is: where does the system remember what we already learned, how can we keep reminding it during longer-running sessions, and what should be enforced instead of merely remembered?

Better prompts help with the session in front of you. Better models help, too. But once agents draft emails, call tools, update systems, or carry work across sessions, reliability comes from the layer around the model.

Over the past year, in my work at Astrolabe and VirWave, I've run agents across software development, product work, data workflows, and internal automation. Some are simple coding assistants I control directly through the chat interface (Claude Code, Codex, etc.). Others might be part of a deployed agentic pipeline (think multiple agents working together) or generally closer to operational systems: they might have narrow instructions to repeatedly perform a specific task, call external tools, analyze datasets, draft emails, inspect UI state, modify files, summarize state, and carry work across multiple sessions.

Across all of it, the same constraint kept showing up:

The agent is not the whole system. The harness is.

By an agentic harness, I mean the durable operating layer around the agent: the context it receives at session start, the tools it can access, the validators that check its output, the hooks that run at specific trigger points, the gates in front of actions that cannot be undone, and the rules that capture known failure modes.

A prompt tells the agent what to do in this session.

A harness tells the system what must remain true every time, and it can be used in both local interactive agents that one controls via a chat interface and deployed agentic systems.

Some parts of a harness are advisory: project instructions, memory files, standing orders, context summaries. They make the agent more likely to do the right thing from the start.

Other parts are enforced: validators, permission boundaries, tool wrappers, approval gates, and blocked commands. They make certain failures impossible, or at least much harder.

Both matter in their own right. If either is missing, things might get brittle.

I saw this clearly while building a pipeline for cold or light-warm b2b outreach: an agentic workflow that scouts companies, researches contacts, drafts outbound emails for review, and then sends them once I give my approval, updating the CRM and sequence tracker accordingly. That is where most of my frustration around email writing stems from.

I run this workflow in two different modes. 1.) As a deployed system of multiple agents that runs automatically in a regular interval, or 2.) especially during development, via an interactive chat session in Claude Code, Codex, or GitHub Copilot, where the agent I am controlling acts as an orchestrator of subagents who then take over the tasks of the deployed agents in the other mode.

Three pieces of that system became harnesses.

The approval gate

The pipeline cannot send emails without explicit human approval.

In the deployed agentic systems, this is structural. Drafting and sending run as separate scripts that invoke separate agents with no connection between them. The script that produces the draft batch exits when it's done. The script that is sent is a separate program. It doesn't run unless the system runtime invokes it. The drafting script has no access to the send function. The architecture prevents the system from sending without my authorization.

When I run the same workflow via an interactive chat session, the gate is different. It lives in the project instructions (a CLAUDE.md file) that the orchestrating agent reads at session start. The rule is clear: no sends without explicit approval. But that rule is advisory. The agent is told not to send without permission; it isn't structurally blocked from doing so.

That difference matters. An advisory gate relies on the instruction holding. Instructions can fade under a long context, get lost in a session handoff, or drift as the system evolves. The structural gate of the deployed system doesn't have that problem; the architecture doesn't forget, regardless of session state.

This is the difference between asking an agent to be careful and designing the system so that care is not the only thing standing between a draft and a customer's inbox.

The wave validator

Before any email enters the send queue, a validator checks it against a set of writing rules.

Some are simple:

No em dashes, because they were showing up often enough to make the emails feel patterned
No filler openers ("Hope this finds you well," "I came across your profile,")
No generic "I wanted to reach out" language
No bridge sentences that over-explain why the offer is relevant
No three-part lists that make every email feel structurally identical

You can instruct the model before it writes by adding a rule to the system prompt or loading it into a hook that fires before the drafting runs. The validator runs after and, in this case, consists of a deterministic piece of code that checks the emails for patterns. Instructions can drift under a long context or vanish in a fresh session. The validator doesn't care what the model was told or what it could remember.

Each rule came from my manual review of the drafts, and the list has grown over time.

At 50 emails a day, five days a week, any pattern that slips through compounds. The validator exists, so I do not have to catch the same pattern manually every time. When it flags a violation, the finding goes back to the drafting agent, who fixes the email and reruns the validator. The loop continues until the batch is clean or until it needs escalation and is given to me for human review.

Validators are not just for email style. The same pattern shows up in other systems I created, agentic research and data-heavy extraction workflows, where the failure might not be an em dash but an unsupported inference. I noticed that agents like to extend real facts into conclusions that the source does not actually support, especially when it lacks evidence for the specific thing it's supposed to extract. In other words, it makes things up if it can't find what it's looking for. In those cases, the harness has to force the system to return without the false conclusion, lower confidence, or point back to evidence instead of letting a plausible but made-up claim pass through. In more complex cases, validators might take on different shapes, not just pieces of code, for example, as an LLM-as-judge, where I call an independent LLM in 'adversarial' review mode to judge.

The tool wrapper

The system also calls external tools. That creates a different class of failure.

At one point, a CLI call from a Python subprocess an agent called was failing because the process did not inherit the environment variable it needed. To the calling agent, the result appeared to be an API issue. The error sent the session down the wrong debugging path.

The fix was small: set the required environment variable every time that tool is called.

I could have told the agent, "Remember to set this variable." Instead, I wrapped the tool call so the environment is set by default. Now the next session does not have to rediscover the failure, and the system carries the lesson in memory.

That is a harness. The pattern is called a tool wrapper. A thin layer that handles the configuration because the caller agent can't be trusted to remember. The right environment is always set, regardless of how or when the agent invokes it.

Some harnesses are project-specific. Others are global across every agent session I run. The harnesses I used here are just examples. They come in many shapes and sizes.

In a coding project, a hook injects the current branch, recent commits, and working tree status. That gives the agent temporal awareness without having to spend the first few minutes finding its bearings.

A context-compression hook enforces a specific summary format: task status, open decisions, files changed, and next steps. Without that, compressed summaries can be technically accurate but operationally useless.

A security hook blocks selected dangerous command patterns, including destructive deletes and pipe-to-shell commands.

A model can be careful. A harness can make some mistakes impossible.

Better foundation models help. But a better model doesn't see the pattern across a wave or across sessions. It's the brain, but it has no memory. The harness does.

This matters beyond personal productivity.

A demo can run on a clever prompt. A production system cannot.

Once an agent can send an email, update a database, open a pull request, trigger a workflow, inspect customer data, or call an external tool, the model is only one part of the system.

The harness is where reliability starts to show up:

What actions require approval?
What outputs are checked before use?
What tools are exposed?
What context is loaded at session start?
What state survives compression?
What failures are blocked by design?
What lessons are written down so the next session does not rediscover them?

That is the layer between model capability and repeatable work.

My rule now is simple: every repeated agent failure falls into one of four categories.

A validator rule to catch the pattern before the agent acts on it.
A tool wrapper that constrains what the tool can reach or assume about its environment.
An approval gate that puts a human between the agent and an irreversible step.
And when none of those fit cleanly, it becomes a line in the project instructions and hence becomes a part of the prompt.

The CLAUDE.md file in that outreach pipeline is now around 300 lines; it grew organically. Some entries are stale, and some should be part of the other harness pieces rather than a note. Refactoring is a real maintenance task that needs to be done periodically. But even with those flaws, it is better than starting every session from scratch.

If you want to figure out where what fits, maybe just ask your agents. They have good opinions, even without a proper harness.

These four buckets are a starting framework. Every system has its own surface area: how narrowly you scope each agent's assignment, how context survives compression, how handoffs are structured. Model-level parameters such as temperature, stop sequences, and output prefilling are another lever on top of that.

The goal is not a perfect configuration. It is a configuration that holds most of the time for the task the agent is performing, fails clearly when it doesn't, and can be patched without rebuilding everything.

You find a configuration that holds. Most of the time, several will. Tighten it where it leaks.

Better prompts help with the session in front of you. The harder work is building the layer that makes every session after it more reliable.

The agent is new every session. The harness stays.

Pass this on if it was useful.