Tests are Boolean. Evals are a Gradient
Why a green suite is necessary but never sufficient
When I ran a dev shop, I advocated for testing: pairing on red-green-refactor, setting coverage gates in CI, and coaching teams until writing the test first felt as natural as writing the function it covered. The suites went green and stayed green.
Sometimes our code got better, but I think we benefited most from protecting our application from getting worse. The tests weren’t failing me; they were doing exactly what they were built to do, which is hold the line against regressions, and that work is indispensable. But it’s not the whole picture.
James Coplien presented a great visual for what’s going on: picture the test suite as a vector. Every test is a dimension, and a passing test contributes a zero, meaning the guardrail I built here is still standing. A thousand green tests is a vector of a thousand zeros, and that vector certifies that none of the failures you took the trouble to anticipate have crept back in. You should not ship without it. What it cannot do is point. An assertion encodes what you already believed correct looked like before you wrote it, so it tells you that you haven’t slipped below a line you drew yourself, which is exactly the job, but it cannot tell you which way is up. You can’t hill-climb on a boolean, because a yes/no has no slope to follow.
Quality was supposed to be the thing underneath all that green but instead the green bar became the target, and once a measure becomes the target it stops measuring the thing you cared about.
These days, teams now evaluate AI systems, where a green dashboard is quietly becoming the stand-in for quality that a green suite used to be. The cleanest name I’ve found for the trap is a meme making the rounds among practitioners, evals are not tests, and we’re going to unpack that idea.
The land of evals
Think of an eval as a glorified assertion with an attached score rather than just pass or fail. In that respect they’re closer to benchmarks or metrics than tests because they tell us where we are on a gradient and we try to climb up that hill. This is especially useful in situations where there isn’t an obvious correct answer that never changes but we have something we want to make better or which can’t get worse than some level. A test might tell you that this string is less than 1000 characters but an eval might tell you that this joke isn’t funny or this poem doesn’t rhyme or this landing page won’t convert customers.
The simplest statement of the difference I’ve read is a short note called Evals Are Not Tests, and the line that stuck with me is this: “A test you can always pass is finished. An eval you can always pass is broken.” Read that twice if you came up through TDD, because it inverts the instinct the green bar trained into you. For a test, all-green means done. For an eval, all-green means the eval has gone too easy and stopped telling you anything, so the right response is to add harder cases until the baseline starts to fail again. These situations where an eval has saturated or overfit a benchmark are often the most valuable because they tell you that your benchmark has to change or your rubric has to become richer.
The rest of the differences fall out of that one. A test scores a single program against a known answer. An eval scores the difference a change made, across many cases and several runs, because the model underneath is stochastic and one run is a sample rather than a verdict. And in place of a lone assertion it grades against a rubric of several criteria at once. A rubric of graded, varying signals has a direction. It can tell you not just that you fell short but which dimension is weakest, and therefore which way is up.
In addition to rubrics the land of evals gives us lots of powerful new concepts when we’re building non-deterministic agents and systems. We talk about hill-climbing, floor-raising and valley-dodging but we also have the ability to use an LLM or a human as the judge for situations where we’re trying to capture something too nuanced to be represented with a simple assertion. Sometimes it’s not just the judges that are intelligent. The system under test may also be trying to cheat (by reward hacking or inferring the answer from the way the question is posed) in order to score higher on the eval. That’s when we start bringing in techniques like adversarial evals or online evals so that the tests are hard enough to tell us what the system under test is actually capable of doing.
That’s really what evals are about. They’re for situations where you want to understand and improve a system that is complex enough (in the Cynefin sense) that you have to poke it with something almost as complex just to keep track of it over time.
Evals lie too, just differently
I don’t want to sell evals as the free lunch that tests turned out not to be, because they have their own way of going quiet on you. The argument in an essay called Your Evals Will Break is that an eval is most dangerous exactly when it looks healthy: the system crosses into some new regime, grows a capability or a failure mode nobody wrote a criterion for, and the eval keeps reporting fine scores against last year’s frontier.
I’ve written before about how a frozen gold dataset quietly betrays you the moment your system touches real users. Unfortunately it gets worse since the world and your users keep changing but your evals will still reflect the old world.
So the discipline isn’t “write evals instead of tests” and then walk away. It’s to keep the eval alive: feed it the failures production actually surfaces, watch whether its scores still discriminate or have flattened into a fresh wall of green, and treat that flattening as the warning it is rather than the victory it resembles. The version that works is an eval built to notice its own obsolescence.
Where we put this to work
This got concrete for me in a small example I built on Flue, the agent harness. Flue runs the agent, and I wanted to know whether my change helped or hurt. So I ran the same support-triage agent two ways, once with a warm prompt and once with a terse one, over a handful of tickets, and let Jetty grade every draft against a rubric that a different model scores. The warm version passed every case; the terse one was cheaper and failed four of five. A green checkmark would have hidden that trade from me. The whole thing is a few files you can run.
The part doing the real work there is the rubric. That’s because it scores each draft across several criteria instead of flipping one bit, a low score points at the weakest criterion, and that’s a direction I can climb. Hand the same agent a pass/fail check and it can tell me it tripped a wire, never whether it is getting better. The grader itself is a Jetty runbook and the section that matters most is the evaluation loop: score, find the weakest dimension, fix it, score again.
I keep circling one question. If you already trust your suite to tell you that nothing broke, what would you have to build to know, with that same confidence, that it’s getting better?


