POV May 25, 2026 8 min read

Harness engineering: build the test rig before the model

Most AI builds fail because the model ships before the harness that measures it. How we engineer evals, guardrails, and a repeatable delivery rig.

by Ankor

A team comes to us with a model that works in the demo and falls apart in production. Nine times out of ten, the model isn’t the problem. The problem is that there was never a harness around it — nothing to measure whether a change made things better or worse, nothing to stop a bad output before a user saw it, nothing to tell you why yesterday’s prompt beat today’s.

We build the harness first. The model is the easy part.

TL;DR — A harness is the scaffolding that holds an AI system honest: evals, guardrails, and observability. Build it before you tune the model, because without it every change is a guess. The same discipline applies one level up: a consultancy that ships reliably across many clients runs on a delivery harness, not heroics. Build the rig, and the model becomes interchangeable.

The first thing we build is never the model

The instinct on a new AI project is to reach for the model — pick the LLM, write the prompt, wire up retrieval, ship a demo. Demos are cheap. The expensive question arrives a week later: is the new version better? If you can’t answer that in minutes with a number, you’re not engineering, you’re gambling.

So before we tune anything, we build the thing that answers it. A graded set of real inputs and expected behaviours. A way to run any candidate — a new prompt, a new model, a new retrieval strategy — against that set and get a score. The moment that rig exists, the project changes character. Decisions stop being arguments about taste and start being experiments with results. You can move fast precisely because you can tell when you’ve moved backwards.

This is the inversion most teams miss: the harness isn’t overhead you add once the model works. It’s the instrument that tells you the model works. You build it first because everything after it depends on its readings.

What a harness actually contains

A production harness has three layers, and skipping any one of them shows up later as an outage or a lost client.

Evals. A versioned set of inputs with graded expected outputs, plus the scoring logic to run candidates against them. Some grading is exact-match, some is rubric-based, some uses a model to judge a model — each with its own failure modes. The point isn’t a single accuracy number; it’s a regression suite you can run on every change. Public tooling like OpenAI’s evals framework is a fine starting shape, but the eval set that matters is the one built from your failure cases, not a generic benchmark.

Guardrails. The runtime checks that sit between the model and the user — input validation, output filtering, schema enforcement, refusal handling, and the fallbacks for when a call times out or returns garbage. Guardrails are where “works in the demo” becomes “safe in production.” Designing them well is most of what AI agent development actually is: an agent without guardrails is a liability with a nice interface.

Observability. Structured traces of every run — inputs, intermediate steps, tool calls, outputs, latency, cost. When something goes wrong at 2am, the difference between a five-minute fix and a five-hour one is whether you can replay the exact run that failed. Observability also feeds back into evals: real production failures become tomorrow’s test cases.

Together these three turn an AI system from a black box into something you can reason about, regress-test, and trust. None of them are the model. All of them outlast it — when a better model lands next quarter, you swap it in and let the harness tell you whether it earned its place.

The harness is also how a consultancy scales

Here’s the part that’s less obvious. Everything above is true one level up, too.

A consultancy that ships AI for a living faces the same problem as a single model: how do you stay reliable across dozens of engagements with different teams, stacks, and risk tolerances? The answer is the same. You build a delivery harness — a repeatable rig that holds the work honest, not just the code.

For us that rig is the seven-stage framework we run on every engagement: Discover, Define, Design, Data, Develop, Deploy, Drive. Each stage has exit criteria — the consulting equivalent of an eval threshold. We don’t move from Define to Design on vibes; we move when the success metric and the guardrails are written down and agreed. The AI readiness assessment is the front of that harness: a structured diagnostic that catches the projects that shouldn’t be built before anyone writes code, the same way a good eval suite catches a bad model before it ships.

This is what lets a team that has shipped software for a decade keep its standards constant across clients. The harness is the institutional memory. New engagement, same rig.

Where the harness pays off: trust, not just accuracy

The reason to invest in all this isn’t a higher accuracy number. It’s trust — and trust is usually what the project is actually about.

Take real-time proctoring, where we built XAM’s assessment platform. The hard problem wasn’t detecting whether a candidate might be cheating; a model can flag gaze drift and off-screen audio readily enough. The hard problem was earning the trust of the human reviewers who act on those flags. A model that cries wolf ten times an exam trains people to ignore it — worse than no model at all.

So we built the system as an evidence pipeline, not a verdict engine. Every signal is scored, timestamped, and bundled into a reviewable timeline; an LLM is scoped to summarising sessions, never to deciding them. That’s a harness decision, not a model decision. The guardrail — keep the model out of the verdict — is what kept the audit trail human-reviewable and the reviewers on side. The result wasn’t a cleverer model. It was a system people were willing to trust with consequential decisions.

That’s the through-line. Accuracy gets you a demo. A harness gets you something a business will put its name behind.

What to do next

If you’re starting an AI build, or rescuing one that stalled, the move is the same: stop tuning and build the instrument first.

Write ten real failure cases before you touch the prompt. They become your first eval set and they’ll surprise you.
Decide what the model is not allowed to do — the verdict it can’t make, the field it can’t hallucinate, the action it can’t take without a human. That’s your first guardrail.
Instrument every run from day one. Cost and latency are features; you can’t manage what you don’t trace.
Give the engagement itself exit criteria. If you can’t say what “done” means for this phase, you’re not ready to start it.

Do this and the model stops being the scary part of the project. It becomes the easy, swappable part — exactly where it belongs.

Ankor builds the evals, guardrails, and observability that make AI safe to ship — see how that shows up in AI agent development, or in the way we shipped trustworthy proctoring in the XAM case study. If you’ve got a system that demos well and breaks in production, book a 30-minute call or email us at team@ankor.us.

#evaluation #agents #guardrails #consulting

The first thing we build is never the model

What a harness actually contains

The harness is also how a consultancy scales

Where the harness pays off: trust, not just accuracy

What to do next

Have a system you need to ship, not demo?