Why we ship evals before we ship features

A familiar conversation. The team has a model-powered feature that works in demos. They've shown it to a customer. The customer wants it. The PM wants to ship in two weeks. Somebody asks about evals. The room agrees that evals are important. The room also agrees that the launch deadline is fixed. The team ships the feature without evals. The eval work goes on the next-sprint list, and from there to the never list.

Six weeks after launch the feature starts misbehaving in ways the team can't reproduce. The model has been updated by the provider. Or the prompt has been edited three times by three different people. Or the upstream data has shifted. Nobody knows when the regression started because nobody was measuring. The team spends two sprints fighting fires. Trust in the feature, and in the team, takes a hit.

This is preventable, and the prevention is simple, and almost nobody does it. The discipline is: write the eval first.

What we're describing isn't a new discipline. It's TDD, applied to AI. The instinct that makes a careful engineer write the unit test before the function shows up here unchanged: write down the bar, then build to it. The mechanics are different — the test is a small set of input/output pairs and a grader, not an assert_equal — but the principle is identical, and most of the lessons from twenty years of TDD travel without modification. If your team already does TDD on the classical parts of the codebase, evals are not a step change; they're the same instinct showing up in a new place.

What "the eval" actually is

An eval is not a benchmark. It is a small, frozen set of real inputs paired with the outputs you'd expect the feature to produce, plus a way to grade the model's output against those expectations.

A useful eval has three properties:

The inputs come from real production traffic (or a close approximation). Synthetic test inputs miss the weird edge cases that real users produce.
The expected outputs are agreed by a human who knows the domain. A subject-matter expert wrote them or signed off on them — not a prompt engineer guessing what "good" looks like.
The grading rubric is mechanical enough to run automatically. Either the grader is itself a model with a clear rubric, or the comparison is structural (does the output contain the expected fields? is the classification right?). If grading the eval requires a human in the loop every time, the eval will not get run.

Three examples is enough to start. The first eval most teams ship has five. Twenty is mature. The aim is not statistical rigour; the aim is a tripwire that fires when the system regresses. Pick the inputs you'd most regret getting wrong, and write down what right looks like for each. The rest can be added as the team discovers cases that matter.

When the formal eval is overkill

Not every model call needs an eval. Models in 2026 are markedly more reliable than they were two years ago, and for a wide class of ordinary tasks — short structured outputs, common classification, simple extraction from clean text — the frontier models are consistent enough that a couple of manual spot-checks during development will catch anything worth catching.

Reserve the discipline for the cases that earn it: features that are customer-visible, features where a wrong answer would embarrass the team, features that depend on the model handling an edge case you've been bitten by before. The rule of thumb is straightforward — if you'd lose sleep over the feature drifting in production, write the eval. If the worst case is "we'd notice and fix it within a day", spot-checks may be enough.

This is a deliberate softening of the orthodoxy. A few years ago, when models were less consistent, the honest answer was "always eval". Today, the honest answer is "eval the things that matter, and don't drown the team in eval scaffolding for the rest". The discipline still exists; the bar for when it applies has risen.

Why writing it first matters

Three reasons.

It forces you to define "good". The act of writing twenty expected outputs makes the team confront what they actually want the feature to do, in cases they hadn't considered. It is common for the eval-writing exercise to surface that the team's product spec was incomplete. Better to discover that on the way in than on the way out.

It gives you a target. The build phase becomes: get the eval green. That is a clean, measurable goal. Without it, the build phase becomes: get the demo to work, which is a much shakier target.

It lets you change models without fear. The provider releases a new version. You run the eval. The eval passes. You upgrade. The whole process takes an hour. Without the eval, the same upgrade is a multi-week regression-testing exercise that the team will avoid until they're forced into it.

The objection, and the answer

The standard objection is that writing the eval slows the team down. At the realistic bar of three to five examples, this is mostly mythical — the eval-writing exercise takes a couple of hours, not a week, and the team agrees on what "good" looks like in a single conversation. Some of the expected outputs are contested, and that's productive, because the contest happens before the feature ships rather than after.

Where the objection has more force is the temptation to treat the first eval as the final one. Don't. The first eval is a starting point that earns the right to grow. As the team finds new edge cases in production — and they will — the eval grows by one example at a time. The teams that get this right don't try to enumerate every case up front; they treat the eval as a living artefact that captures whatever the team has learned so far.

The maturer the team, the smaller the up-front cost. The first eval a team writes is the awkward one. The fifth is routine. The fifteenth happens in an afternoon, because the team has a template, a grading helper, and an internal habit of asking "what's the eval" before they ask "what's the prompt".

What ships when

The shipping order in an AI-first team is:

Write the eval. A handful of inputs paired with what "right" looks like. Run it past whoever knows the domain — a five-minute conversation, not a formal review.
Build the feature against the eval.
Pass the eval at the agreed bar. (Not 100%. The bar is whatever the team has decided is acceptable for this feature. For an internal triage tool, 85% might be fine. For a customer-facing summariser, the bar is higher.)
Ship.
Wire the eval into the build pipeline so it runs on every change.

Step five is the one that makes the difference six months later. The eval is not a one-off check; it is a permanent observation point. When a prompt changes, the eval runs. When the model changes, the eval runs. When the upstream data shifts, the eval — which uses real inputs — starts failing, and the team finds out before the customer does.

The minimal practice

For a team that has never shipped an eval, the minimal version of this discipline is:

One file per feature, in the repo, called evals.json.
Three to five input/output pairs to start, with what "right" looks like for each. Add more as the team encounters cases that matter.
A script, run_evals.py, that loads the file, calls the feature, and reports a pass rate.
A line in CI that runs the script on every change and fails the build if the pass rate drops below the agreed bar.

That is half a day of engineering, total, for the first feature. Every subsequent feature reuses the harness and adds another evals.json. After three or four features, the team has a permanent, automatic, regression-catching system that costs almost nothing to maintain — and the bar inside each evals.json rises naturally as the team learns what real production has thrown at it.

The teams that adopt this discipline ship more reliably and sleep better. The teams that don't find themselves, six months in, doing the eval work anyway — usually in the worst possible week, in the wake of an incident.

We help teams build evals before the first incident, not after. Talk to us →