The Jetty Blog: Ground Truth

Looping into a Game About AI Evals

Jonathan Lebensold — Tue, 21 Jul 2026 08:32:22 GMT

Loop engineering is core to how Jetty’s runbooks work to improve AI workflows. It works really well: the agent has existing inputs, logs and telemetry and the outputs from a previous run. With all of this context, it becomes possible to improve the workflow, sometimes automatically. We decided to take this principle and build a game about AI evaluation. But would it be fun?

It started as a back-and-forth with Claude. By turning a pile of ideas into a product requirements document, we were able to make something you could actually play. Part of what I’m after is letting people feel what it’s like to watch an AI workflow get better with a human in the loop, instead of reading about it.

We started with GEPA: a reflective prompt optimizer by people at Berkeley and Databricks. This technique reads back its own trajectories in plain language, proposes fixes, and tries them. Sometimes, it can even beat reinforcement learning at a fraction of the compute cost.

We used these techniques to make Pelly’s Clinic, a browser game that teaches agent evaluation without ever using the word “eval.” To ground it in an example we’ve already worked on, we decided to frame the game around a doctor-patient visit.

Doctors increasingly hand their patient conversations to an AI “scribe” that listens to the visit and writes it up as a SOAP note: Subjective (what the patient reports), Objective (what the exam shows), Assessment (the diagnosis), Plan (what to do next). It’s the format every clinician’s chart already follows. The danger is that a scribe can hallucinate: write down something the patient never said, or promote a vague complaint into a confident diagnosis. In a medical record, an invented symptom can steer a real treatment decision.

Jetty’s agent loop: a prompt, a runbook with evals, an application, then optimize.

In the game you coach Pelly, pelican working at a harbor wildlife clinic. A doctor sees patients (a gull, a seal, a very difficult octopus), and Pelly’s job is interview each patient and convert what they say into a structured output. But since Pelly is still learning, he can sometimes embellish stories into a “fish story”. For instance a seagull swears three eagles jumped him, and then Pelly charts the eagles as fact.

The first version was functional and completely un-fun: lots of screens, textboxes and back-and-forth dialogue. So we decided to use AI evals to make a game about AI evals better: we let agents play it.

We started with an runbook (a plain-markdown set of instructions) and pointed Jetty at it. The runbook tells an agent to open a real Chromium browser in a sandbox and play the game while journaling every moment of confusion with timestamps and screenshots.

The output from running the agent’s playtest.

Then it scores its own report against a five-part rubric. The runbook loop judges the insights from the run and decides whether the proposed fixes make sense. From the outermost loop we can ask claude code with the Jetty plugin to review the latest playtest, feed the five insights into a round of changes, redeploy, and set the agent loose again.

The best part was watching the loop argue with itself. Round three’s top complaint: “your title screen says the same thing twice. Replace the tagline with a visual of the actual mechanic.” Round four’s number-one finding, on the very next build: “I spent the first 90 seconds clicking the cards on the title screen because they look exactly like the live game. I thought I was already playing.” Every fix repositioned the next complaint.

After 5 rounds, we’re on our way to balancing a fun interaction while touching on the role of an LLM judge in AI evals. How should we guide the next loop of development?

Tests are Boolean. Evals are a Gradient

Jonathan Lebensold — Mon, 22 Jun 2026 16:47:30 GMT

When I ran a dev shop, I advocated for testing: pairing on red-green-refactor, setting coverage gates in CI, and coaching teams until writing the test first felt as natural as writing the function it covered. The suites went green and stayed green.

Sometimes our code got better, but I think we benefited most from protecting our application from getting worse. The tests weren’t failing me; they were doing exactly what they were built to do, which is hold the line against regressions, and that work is indispensable. But it’s not the whole picture.

James Coplien presented a great visual for what’s going on: picture the test suite as a vector. Every test is a dimension, and a passing test contributes a zero, meaning the guardrail I built here is still standing. A thousand green tests is a vector of a thousand zeros, and that vector certifies that none of the failures you took the trouble to anticipate have crept back in. You should not ship without it. What it cannot do is point. An assertion encodes what you already believed correct looked like before you wrote it, so it tells you that you haven’t slipped below a line you drew yourself, which is exactly the job, but it cannot tell you which way is up. You can’t hill-climb on a boolean, because a yes/no has no slope to follow.

A suite of unit tests don’t teams a gradient for agent improvement.

Quality was supposed to be the thing underneath all that green but instead the green bar became the target, and once a measure becomes the target it stops measuring the thing you cared about.

These days, teams now evaluate AI systems, where a green dashboard is quietly becoming the stand-in for quality that a green suite used to be. The cleanest name I’ve found for the trap is a meme making the rounds among practitioners, evals are not tests, and we’re going to unpack that idea.

The land of evals

Think of an eval as a glorified assertion with an attached score rather than just pass or fail. In that respect they’re closer to benchmarks or metrics than tests because they tell us where we are on a gradient and we try to climb up that hill. This is especially useful in situations where there isn’t an obvious correct answer that never changes but we have something we want to make better or which can’t get worse than some level. A test might tell you that this string is less than 1000 characters but an eval might tell you that this joke isn’t funny or this poem doesn’t rhyme or this landing page won’t convert customers.

The simplest statement of the difference I’ve read is a short note called Evals Are Not Tests, and the line that stuck with me is this: “A test you can always pass is finished. An eval you can always pass is broken.” Read that twice if you came up through TDD, because it inverts the instinct the green bar trained into you. For a test, all-green means done. For an eval, all-green means the eval has gone too easy and stopped telling you anything, so the right response is to add harder cases until the baseline starts to fail again. These situations where an eval has saturated or overfit a benchmark are often the most valuable because they tell you that your benchmark has to change or your rubric has to become richer.

The rest of the differences fall out of that one. A test scores a single program against a known answer. An eval scores the difference a change made, across many cases and several runs, because the model underneath is stochastic and one run is a sample rather than a verdict. And in place of a lone assertion it grades against a rubric of several criteria at once. A rubric of graded, varying signals has a direction. It can tell you not just that you fell short but which dimension is weakest, and therefore which way is up.

In addition to rubrics the land of evals gives us lots of powerful new concepts when we’re building non-deterministic agents and systems. We talk about hill-climbing, floor-raising and valley-dodging but we also have the ability to use an LLM or a human as the judge for situations where we’re trying to capture something too nuanced to be represented with a simple assertion. Sometimes it’s not just the judges that are intelligent. The system under test may also be trying to cheat (by reward hacking or inferring the answer from the way the question is posed) in order to score higher on the eval. That’s when we start bringing in techniques like adversarial evals or online evals so that the tests are hard enough to tell us what the system under test is actually capable of doing.

That’s really what evals are about. They’re for situations where you want to understand and improve a system that is complex enough (in the Cynefin sense) that you have to poke it with something almost as complex just to keep track of it over time.

Evals lie too, just differently

I don’t want to sell evals as the free lunch that tests turned out not to be, because they have their own way of going quiet on you. The argument in an essay called Your Evals Will Break is that an eval is most dangerous exactly when it looks healthy: the system crosses into some new regime, grows a capability or a failure mode nobody wrote a criterion for, and the eval keeps reporting fine scores against last year’s frontier.

I’ve written before about how a frozen gold dataset quietly betrays you the moment your system touches real users. Unfortunately it gets worse since the world and your users keep changing but your evals will still reflect the old world.

So the discipline isn’t “write evals instead of tests” and then walk away. It’s to keep the eval alive: feed it the failures production actually surfaces, watch whether its scores still discriminate or have flattened into a fresh wall of green, and treat that flattening as the warning it is rather than the victory it resembles. The version that works is an eval built to notice its own obsolescence.

Where we put this to work

This got concrete for me in a small example I built on Flue, the agent harness. Flue runs the agent, and I wanted to know whether my change helped or hurt. So I ran the same support-triage agent two ways, once with a warm prompt and once with a terse one, over a handful of tickets, and let Jetty grade every draft against a rubric that a different model scores. The warm version passed every case; the terse one was cheaper and failed four of five. A green checkmark would have hidden that trade from me. The whole thing is a few files you can run.

The part doing the real work there is the rubric. That’s because it scores each draft across several criteria instead of flipping one bit, a low score points at the weakest criterion, and that’s a direction I can climb. Hand the same agent a pass/fail check and it can tell me it tripped a wire, never whether it is getting better. The grader itself is a Jetty runbook and the section that matters most is the evaluation loop: score, find the weakest dimension, fix it, score again.

I keep circling one question. If you already trust your suite to tell you that nothing broke, what would you have to build to know, with that same confidence, that it’s getting better?

Verifiable Agents Engender Trust in Systems

Jonathan Lebensold — Sat, 13 Jun 2026 12:32:29 GMT

A customer asked me a sharp question on a call last week. Why generate a batch of possible solutions and prune down to the good one? Can’t you spend that same compute getting the answer right the first time? Generating ten and throwing away nine looks wasteful. One careful pass looks like the disciplined engineering choice, the way you’d want a senior person to work.

There’s some math we could go into about increasing variance of outcomes, which we can get into, but fundamentally this post is about trust.

We trust colleagues who tell us when they are wrong and then fix their mistakes. Think about who you actually rely on at work. It isn’t the person whose code never has bugs, because that person doesn’t exist; it’s the person who says “I broke the build, I’m on it” before you have to go find them. That honesty is only possible because there is a shared standard both of you can check against, whether the test suite went red or the customer complained or the number moved. Take away the standard and you get opaque responses like “looks great” every time, and you end up with a new job: quietly verifying everything yourself.

An agent with a shelf full of skills and no evaluation is that colleague. I wrote a while back about watching an agent run a six-step pipeline, skip two of the steps, and declare the whole thing complete with the confidence of someone who had finished. Because the agent didn’t have criteria for when it should terminate, it had no way of determining whether it had completed the workflow correctly.

Having some grounding in an agentic workflow is the precondition to building trust, and the evaluations are a mirror. Without a reflective step, the agent cannot be honestly wrong, which means it is something you have to babysit forever.

The idea that countervailing objectives keeps coming: whether it’s a primal/dual Lagrange formulation, generators and discriminators in GANs, exploration and exploitation in bandits, or an actor/critic in RL, the idea that you need a source of variance and a means to control that variance keeps recurring in AI.

My belief is that having these two signals for learning ends up breaking up the problem in a way that reduces the overall predictive power, and ultimately computing resources, required to get to a good solution.

For most useful work, checking whether an answer is good is far easier than producing a good answer from nothing, the way verifying a filled-in Sudoku takes a glance while solving the blank one takes concerted effort. Generation got cheap and verification didn’t keep pace, and that gap is the whole reason generate-and-prune beats the single careful shot. When AlphaCode competed in programming contests, it didn’t reason its way to one elegant program. It generated something close to a million candidates per problem and then filtered them down, because the generator on its own was mediocre and the generator plus a verifier was competitive with humans.

Explore then Exploit

After a few years of training models before I built developer tools, the thing that finally made this click for me was a framing that comes from reinforcement learning. You cannot exploit a solution you have never generated. A single careful pass is pure exploitation of whatever the model already thinks is most likely, so when that most-likely answer is wrong, no amount of extra compute poured down the same path finds the better answer sitting at lower probability. You have to explore before you can exploit, which is what generating many candidates is for. The verifier is how you cash in on what the exploration turned up.

Connecting Verifiers to Skills

Which brings us to skills, because the industry has fallen in love with the wrong half.

A skill is a generator. A CLAUDE.md rule or a tuned markdown guide shifts the distribution you sample from toward better output, and that is genuinely useful work. But a skill is not a verifier, and stacking more of them never turns it into one. The agent gets no signal from them, and neither do you: if all you have is skills, you cannot tell whether last month’s changes left the system better or worse, which is the whack-a-mole loop where every fix feels like progress and nothing compounds. An evaluation closes that gap because it is the one artifact that can say “this run failed, here is the check it failed”.

There is a quieter cost to the skills-only approach, and it surfaces the day you try to leave. Every rule you add is written against one harness, so it lives in that tool’s config and assumes that tool’s defaults, and when you swap your agent for a different one next year the rules don’t come with you. An evaluation has no such problem, because “here is what good output looks like and here is how to check it” is a statement about your problem rather than about your tooling.

The Harness is Disposable

The generator is shaped like the harness and is disposable; the verifier is shaped like the work and outlives every harness you run it through. If you are going to invest somewhere durable, invest there. None of this means generation doesn’t matter. A better generator paired with a good verifier beats a worse one every time, and skills are exactly how you get the better generator. The point is narrower, and easy to miss while everyone is shipping skill libraries: the thing you are actually buying when you build generate-and-verify isn’t a marginally better answer, it is a system that can tell you when it failed. That property is the precondition for trust, and trust is the only thing that lets you stop checking the work by hand.

Valley-Dodging: The Half of Agent Reliability You're Not Optimizing

Jonathan Lebensold — Mon, 08 Jun 2026 13:35:05 GMT

You can ship an agent that passes every eval you wrote and still get paged at 2am, because a runbook scoring 88% with a green test suite tells you nothing about the one run in twenty where it does something you’d never have signed off on: it wipes a directory it had no business touching, or drops a customer record into a log where anyone can read it. Almost everything we do to make an agent better is aimed at lifting those scores and improving our workflow. We chase the number up: we hill climb. But chasing the number is only half the story.

The people who train models for a living have learned the other half the hard way, and I spent a few years as one of them before I built developer tools. The signal that kept causing trouble was never the reward for good behavior; it was the penalty for bad behavior. What you do with the runs that went wrong, it turns out, matters more and behaves worse than what you do with the runs that went right.

When the downside can’t be undone

Why should one run in twenty outweigh the other nineteen? Because some failures don’t average out. Nassim Taleb calls the kind that matters an absorbing barrier: a state that, in his words, “prevents people with skin in the game from emerging from it,” a loss that swallows every gain that came before it and forecloses every gain that might have come after. Bankruptcy is one. So is a deleted production database, or a customer record that has already been scraped out of a log. Once an agent can reach a state like that, the average stops being a number you can trust, because Taleb’s rule for ruin is blunt: when the downside is unrecoverable, cost-benefit analysis no longer applies. Sequence is what does the damage, since ninety clean runs followed by the one that wipes the database never settle into a comfortable expected value: after the wipe, there is no ninety-first run.

We still need to put fences around optimization death valley.

Why the negative signal is worth the danger

None of this means you should simply suppress the bad runs, because the negative signal is also where a surprising amount of the improvement comes from: penalizing failure, done well, moves a model toward good behavior faster than reward alone can. The trouble is that it is powerful in both directions. When you reward good trajectories at full strength the worst case is dull, because nothing in that direction runs away and the model just learns slowly; when you penalize bad trajectories the same careless way the math comes apart, because the penalty term has no floor and will drive the model into garbage if you let it. My own brush with this was a paper, Tapered Off-Policy REINFORCE, with Marc Bellemare and Nicolas Le Roux, whose whole result was a way to keep the penalty’s value while bounding its blast radius. But TOPR is one data point in a broad body of work that keeps reaching the same conclusion: the positive and negative signals need different machinery, and the negative one is the half that demands care.

Valley-dodging is the other half

So the number is not enough, because chasing a better average is hill-climbing, and reliability needs the opposite motion, which barely has a name. Call it valley-dodging: the work of keeping the agent away from the runs you can’t take back. You are already doing it by hand. Open any runbook that has survived real traffic and you’ll find it full of lines like “do not skip this step,” “never write outside the results directory,” “do not report success if a check failed.” Nobody wrote those to raise a score; each one is a fence around a specific cliff something fell off of once, in a way that hurt. (I keep a “never run git push” line in one runbook because an agent, told to commit its work, once helpfully pushed a half-finished branch to main.) Given that the downside is an absorbing barrier, treating a catastrophic failure as far heavier than an ordinary miss isn’t a bias to correct for; it’s just correct accounting.

The tooling, though, still only knows how to climb. When `/optimize-runbook` reads your trajectories, it finds what’s dragging the pass rate down and proposes edits that pull it back up, which works right up until it has to tell a slightly worse answer apart from an unrecoverable one, because both reach it as the same lower number. And the answer is not to pile on prohibitions, because if you stack enough hard “never do this” rules the agent goes brittle: it starts refusing reasonable asks, or it honors the letter of every rule while walking past the point of the task. That is the same failure as the unbounded penalty in training, so the craft is the same: keep the negative signal, but bound it.

What I keep turning over is whether a runbook should declare its failure modes the way it declares its steps, as a short typed list with severities, so the optimize loop could weigh one catastrophe against ten ordinary misses instead of folding everything into a single pass/fail number.

A Pelican Learns to Ride

Jonathan Lebensold — Fri, 22 May 2026 16:47:00 GMT

People ask, occasionally, why our logo is a pelican. A company called Jetty, a bird on the dock — there must be a story. The honest answer is anticlimactic, and I’ll get to it.

The reason the question comes up at all is Simon Willison. For the last year or so, he’s been asking every new model to generate an SVG of a pelican riding a bicycle and posting the results. It’s become the unofficial visual reasoning test for frontier models. So when people see the Jetty mascot, the assumption is that we picked the pelican as a wink at the benchmark.

We didn’t. Our designer worked through a stack of jetty-adjacent motifs — boats, ropes, lighthouses, a few different seabirds — and the pelican was the one the team kept coming back to. That decision predates the benchmark mattering to us. The overlap is coincidence. A good one, but coincidence.

It is, however, the kind of coincidence you don’t ignore.

So we ran the benchmark

Simon’s test is a good one. It exercises layout, anatomy, two distinct subjects, motion, and the model’s ability to keep coherent state across hundreds of XML elements. It’s visual enough that you can tell at a glance whether the model got it.

We ran the same task eighteen times — eleven times with one model while iterating the prompt, seven times with one prompt while iterating the model — and captured every trajectory. The result is at pelicans.jetty.bot. Seven agent/model combinations. Seventy-one runbooks across seven lineages. Every iteration replayable.

Hermes Agent + Gemini 3.5: the chonkiest pelican on the jetty.

The point isn’t the pelicans. The point is what you can do once each attempt is a structured artifact instead of a screenshot in a tweet.

Three views

The Climb ranks all seven agent/model combos on the same v1 runbook. Scatter chart, best-of-ten table. You can see the shape of how each agent’s score moved across its ten rounds — not as a final number, but as a curve.

OpenCode hill-climbing

Head-to-Head is a filmstrip viewer. Arrow keys step through rounds; up/down switches agents. You’re watching the same task unfold ten times in parallel for each lineage, and the convergence is its own kind of evidence. Some lineages crawl. Some snap into shape on round three and then drift.

Runbook Diffs is the view I keep coming back to. The runbook carries a baseline SVG embedded as a seed — so iterating the runbook means editing that seed plus the targeted prompt asks. Pick any two versions across the seven lineages and diff them line by line, side-by-side or unified. It looks exactly like reviewing a pull request. Because that’s what it is.

Jetty’s agent is hill climbing to a better pelican runbook.

The judge gap

A gemini-cli run scored 40 out of 40 on the rubric. Pelican, bicycle, composition, polish — four perfect tens.

Then we re-judged the same SVG with Claude Sonnet. It came back at 37.

Same image. Same four-axis rubric. Different judge. Gemini Flash, used as the judge, awards full marks freely. Sonnet, scoring the same SVG against the same axes, caps out around 37 and rarely gives a perfect score. The temptation is to call one of them right. Neither of them is. They’re calibrated differently. Flash’s 40 isn’t the same number as Sonnet’s 40, and pretending otherwise is how leaderboards lie.

This is the practitioner problem with LLM-as-judge that nobody puts on the box. At some point you’ll want to compare scores across two evaluations — same rubric, different runs — and reach a conclusion. If the judge changed underneath you, even silently, the numbers are no longer on the same axis. You either lock the judge for the lifetime of the comparison, or you design the rubric so swapping doesn’t matter. Most rubrics don’t survive that test.

We didn’t fix it. We just made the gap visible.

What this kind of artifact is for

There’s a broader thing here that has nothing to do with pelicans.

Most benchmark posts give you a headline number and a screenshot. Sometimes a GitHub repo with the prompts. What you can’t do is replay the attempt, see what the model produced on round three before converging, or diff version four of the runbook against version seven and ask which edit moved the score. The interesting questions live in the trajectories, and the trajectories don’t exist.

When every iteration is captured, the benchmark becomes inspectable in the way code is inspectable. The runbook is the source. The trajectory is the build artifact. The diff between two runbook versions is a PR you can review, comment on, and revert.

The pelican project is small enough to fit on one page and weird enough to share. The structure underneath is the part worth stealing.

Lights-Out Manufacturing Had a Brake Pedal

Jonathan Lebensold — Sat, 16 May 2026 14:26:49 GMT

An agent I was watching a six-step extraction pipeline. It made it to step three, hit a schema mismatch, and “recovered” by skipping to step five. Then it wrote a polished summary citing results from a step that never ran.

$54 in API spend. Three days for someone to figure out why the downstream dataset was corrupted. The trace was sitting in Langfuse the whole time.

I spent this week at Web Summit in Vancouver talking to engineering leaders driving AI adoption inside their companies. Every one of them had a version of this story — and the common thread was that the agent habits forming inside their dev environments are landing in production unchanged. There’s no staging environment for agent behavior. A skill that silently skips a step on a developer’s laptop silently skips it in production — at customer scale, with customer data.

This is the failure mode missing from the conversation about autonomous AI. Not “the model can’t do the task.” The model can do the task. The system around it doesn’t know when the task wasn’t done — and the same gap that costs $54 in development costs orders of magnitude more in production.

The lights are going off anyway

There’s a name for what the industry is building toward. The dark factory.

At the foot of Mount Fuji, FANUC’s Oshino plant produces fifty industrial robots per twenty-four-hour shift and runs unmanned for up to thirty days at a stretch. Gary Zywiol, the company’s vice president, has a quote about it that’s now part of the canon: “Not only is it lights-out, we turn off the air conditioning and heat too.” Siemens runs lights-out plants. Xiaomi opened an 81,000-square-meter unmanned smartphone facility in 2024 that produces ten million units a year with no humans on shift. The International Federation of Robotics put China at over two million factory robots in 2024 — fifty-four percent of global demand. A peer-reviewed survey in *Engineering Manufacture* this year called lights-out factories “the pinnacle of manufacturing advancement.”

The software industry is racing toward the same idea from the opposite direction. BCG Platinion has been writing about *The Dark Software Factory* — AI agents handling planning, coding, testing, deployment, with no human on the PR. Stripe is already operating at 1,300-plus AI-authored pull requests per week. The destination isn’t theoretical. The lights are going off.

Here’s what I keep coming back to about Oshino, Xiaomi, and every other plant where the lights are already off: they work because the operational half of the stack existed first.

What made lights-out actually work

Specified tolerances and statistical process control. QA loops you instrument before you turn off the lights. Shutdown criteria — what halts the line, what alarms, what gets quarantined. Calibration schedules nobody can override.

None of these are sexy. All of them existed before the humans left the floor. Olivier and Craig’s 2017 IEEE AFRICON paper Lights-out process control — analysis and framework puts it bluntly: every unmanned production line is wrapped in a deterministic decision layer that knows when to halt, alarm, retry, or quarantine. The robots aren’t the achievement. The discipline around the robots is.

The manufacturing industry spent forty years building that layer. There’s another IEEE paper from 2017, Industrial robotics in factory automation: from the early stage to the Internet of Things, that traces the arc — from the first numerically-controlled machines through the SCADA era and on into the modern smart factory. The throughline is the same in every chapter. The operational layer always shows up before the lights go off. Not after.

AI is doing it backwards. The agents are loose in production. The operational primitives — observability that surfaces patterns instead of just collecting data, durable orchestration that survives a crash on step seventeen, end-to-end evals that catch regressions before customers do, output verification that knows what “done” looks like — are still being hand-built by every team independently.

Dark factories without observability and evaluation are motion without progress.

What’s actually missing

In May, Tobi Coker at Felicis published a survey of twenty-three AI-native engineering leaders that put numbers on this. *The AI Stack Is Half-Built*. It’s worth reading in full.

The headline that’s stayed with me: 69.6% of teams doubled their inference spend in six months. 34.8% now report inference running at five times their training cost. And 56.5% of those teams are managing the fastest-growing line item in modern software with home-built spreadsheets and custom dashboards. The infrastructure category that should exist doesn’t — yet.

47.8% of the surveyed teams are running autonomous agents in production. One respondent named the missing primitive plainly: “retry this 20-step agent run from step 10.” Nobody ships that. So teams duct-tape frameworks together. “Stitching it,” another engineer wrote, “still takes too much custom glue.”

45% named evaluation as their single biggest unsolved problem. The hard part isn’t testing individual LLM calls — it’s running evals across the entire system, end-to-end, with state and tool calls and branching. 57.9% would rather build their own eval framework than buy an existing one. Not because they want to. Because nothing they’ve found does what they need.

What’s striking isn’t any individual number. It’s that the three gaps describe the same hole from three angles. Observability that doesn’t surface patterns. Orchestration that can’t recover from a partial failure. Evaluation that can’t measure the system end-to-end. These aren’t three problems. They’re one problem — the missing operational layer — measured by three groups of people who don’t yet have a shared name for what they’re missing.

I’ve spent the last year auditing production Langfuse projects. We’ve crossed 100,000 traces across voice AI, agentic tooling, healthcare pipelines, extraction services. The patterns the Felicis survey describes aren’t visible only from the top down. They show up in every audit. Redundant API calls because nobody had a cache. System prompts resent on every turn because nobody had configured prompt caching. Five GPT-4o versions running simultaneously because pinning was easier than maintaining. Agentic workflows accumulating $50 in a single trace because the operational layer that should have summarized intermediate context wasn’t there.

The leaders in the survey and the production systems I see in audits are looking at the same thing. They just don’t have the vocabulary to coordinate.

We’ve done this before

The agile movement spent fifteen years building automated testing, feature flags, blue-green deploys, canaries, and continuous delivery before deploy-on-merge became safe enough to be the default. I lived through the front half of that arc. The pattern was always the same: a new way of shipping arrived, the old operational discipline didn’t fit, and the industry spent a decade reverse-engineering what “done” and “broken” should mean in the new model.

Big Bang Oracle releases. Friday-night dread. The rollback Saturday. The post-mortem Monday. The connection between that older pattern and the modern foundation-model-version-swap dread is direct, and I’ve written about it elsewhere. The shape of the failure is the same. So is the shape of the fix.

The fix was never a new framework. It was operational discipline written down where everyone could read it. Test suites. Runbooks. Smoke tests. Verification scripts. The lift wasn’t in inventing new tools. It was in agreeing on what done meant and refusing to call something done until it passed.

The contrarian bet

Here’s what I think is the easy bet to miss right now.

Coker’s three gaps will eventually become VC-funded categories. Inference observability will get its Datadog. Agent orchestration will get its Spinnaker. End-to-end evaluation will get something — maybe two or three things — that finally does for AI what unit tests did for code. The market will sort it out. Some bets will pay off. Most won’t. The category-defining product for each of these gaps doesn’t exist yet, which is exactly why VCs are paying attention.

But the teams that pull ahead aren’t waiting for the category-defining product. They’re building the operational layer with what they already have.

And the operational primitive we keep returning to — across hundreds of customer trajectories and a year of audits — is a markdown file with a rubric and a bash verification script at the bottom.

That’s not a slogan. It’s what survives load. The sophistication isn’t in the format. It’s in what the document encodes: what “done” means, how to evaluate against it, what to do when it falls short, when to stop trying. We call this a runbook because the older shape it most resembles is the operations runbook on the wall behind a NOC. But it’s also a test suite. And a spec. And a quality gate. And a refusal to call something done until it passes.

The contrast is funny. The industry is racing to build increasingly elaborate agent frameworks — graph planners, multi-agent orchestrators, memory layers, MCP everything — while the most reliable primitive in our toolbox is plain text with a checker. Markdown works with every agent. It’s version-controlled. It’s diffable. It’s reviewable in a pull request. It doesn’t need a runtime. It’s the lightest possible structure that still encodes the operational discipline a dark factory actually requires.

The dark factory is coming. The lights are going off in software the way they went off in manufacturing. But Oshino doesn’t run because the robots are smart. It runs because the deterministic decision layer around them never lets a bad part through.

If you’re building the AI version of that — and if you’re shipping agents into production, you are, whether you’re using that vocabulary or not — the operational half of the stack is your problem. Not Coker’s category-defining vendor. Not next quarter’s better orchestration framework. Yours. This week.

You need four pieces. The dashboard, which most teams already have. The runbook, which does the operational lift underneath. The PR, which is the artifact the team reviews and merges. And — bluntly — the eval. That’s your brake pedal.

Build those four. The rest of the stack can be half-built indefinitely.

What I keep wondering is where the ceiling is. If a markdown file with a checker is good enough for everything we’ve thrown at it so far, where does it break — and what does an agent task look like when plain text and a bash script stop being enough?

Research Closes the Loop. Production Keeps Us In It.

Jonathan Lebensold — Fri, 08 May 2026 10:16:29 GMT

An ICLR Oral just showed that reflective prompt evolution beats reinforcement learning.

The paper, GEPA, describes a search procedure where a language model reads its own rollouts and proposes prompt edits in plain English. The search uses those edits as its mutation operator. The result beats GRPO — the RL baseline from DeepSeekMath — across three benchmarks: HotpotQA, AIME, IFBench. Fewer rollouts. No reward model.

The paper is right. I want to say that first, because most of what follows could read like a disagreement, and it isn’t. The inner generate-judge-refine loop GEPA describes is the same loop a Jetty runbook runs on every execution. The mechanism is real and it generalizes.What’s interesting isn’t whether GEPA validates the runbook approach. It does. What’s interesting is the part of the loop GEPA closes that we deliberately keep open.

The inner loop is generator-verifier

GEPA’s contribution is the reflection step. A language model reads its rollouts and identifies the failure pattern in natural language. That diagnosis becomes the mutation operator the search uses to propose the next prompt.

This isn’t a one-off result. It’s the third optimizer the DSPy team has shipped, after bootstrap-fewshot and MIPROv2. The closest methodological cousin is TextGrad — backpropagating textual feedback through compound LLM systems. The evolutionary ancestor is Promptbreeder. Three years of lineage, not a single ICLR moment.

The thing that makes any of it work is generator-verifier asymmetry. I’ve written about this before — producing a good answer is hard; recognizing one is cheap; the gap between the two is the only reason iterative loops ever converge. The verifier — a rubric, a benchmark, a unit test — has to stay independent of the thing generating candidates. Without that independence, you’re not climbing a gradient. You’re staring into a mirror.

Our runbooks run this exact loop on every execution. The rubric step (”what does done look like, how do we check”) is the verifier. The bounded retry budget is the search. The agent reads its rubric failure and tries again. It stops when the rubric clears or the budget runs out. That’s GEPA’s inner loop, applied to the artifact instead of the prompt.

Two lenses on the outer loop

The split between GEPA and runbooks occurs on what happens across runs, not within one: research optimizes for “did the score go up.” Production optimizes for “did the right people see the change before it shipped.”

In research you have a benchmark. The benchmark is ground truth. Closing the loop by auto-evolving the prompt against rollout data is unambiguously good. The eval signal is trustworthy and the cost of a regression is “the number went down on a graph.” However, in production you have a runbook running brand-compliance review on every marketing draft. The rubric is your team’s judgment compressed into prose. The eval signal is a proxy. It encodes what you decided “good” means three months ago.

Auto-evolving against that proxy means the runbook can drift in a direction nobody on the team noticed. Worse: if the system can rewrite the rubric and optimize against it, the verifier stops being independent of the generator. The asymmetry that made the inner loop work breaks down. You haven’t built a self-improving system. You’ve built a self-justifying one.

We could automate this. We chose not to.

This is the load-bearing point.

Jetty has all the parts. `/optimize-runbook` already reads trajectories and proposes runbook edits. Routines fire on a cron. Wiring up nightly auto-evolution against last week’s trajectories is a few config lines. The trajectory storage GEPA’s approach needs as a substrate is already first-class.

We didn’t ship that as the default. What ships is a runbook in git. Someone runs /optimize-runbook when they suspect drift. The diff lands in a PR like any other change. Closed loop available; open loop default.

This isn’t theoretical. Decagon ran GEPA on real customer-service prompts and wrote up what happened. Naive runs produced ~5,000-character prompts that overfit to small reflection sets. Smaller reflection models broke outright. The optimal sample range turned out to be 20–100, not “more is better.”

Their fix amounted to code review for prompts. They added a holdout set so the optimizer couldn’t lie about its own progress. Length regularization stopped the prompts from sprawling. The whole optimization started getting treated like the test-driven engineering loops everyone already trusts for shipping software.

That’s the production reality the academic version doesn’t have to think about.

The merge step is the feature

Production runbooks aren’t AIME problems. They’re operational artifacts that govern real outputs to real people, and the reasons to keep a human in the merge step are the same reasons engineering teams keep humans in the merge step for code.

Someone has to know the rubric tightened. Otherwise the team is debugging “why did our outputs change last Tuesday” with no change to point at. A runbook that silently rewrites itself can’t be pinned to a sprint or to an experiment, much less an audit window — you lose the ability to say “this output came from runbook v1.3.” When the runbook ships a bad output, “who changed this and why” needs an answer. A diff in git with a reviewer attached has that answer. A self-evolving prompt does not.

This is the PR-is-the-product thesis applied to the runbook itself. The runbook diff is the artifact your team reviews. The fact that an agent proposed the diff doesn’t change who owns the merge.

Engineers already trust this interface. Auto-formatters write code. Linters fix style. Your test runner can fail a build with no human signoff at all. All closed loops. Auto-merge to main? Not yet. Branch protection, CODEOWNERS, required reviewers — that’s the merge-policy interface that makes every other closed loop safe to ship. Without it, every commit is an unsupervised optimization against an untrusted verifier.

What’s actually open

GEPA’s claim isn’t wrong for its domain. If you’re tuning a prompt against a benchmark with trustworthy ground truth, closing the loop is the right move. The literature is also more contested than the ICLR headline suggests and Benjamin Anderson’s structural critique argues agents don’t have the locality property modular optimization assumes.

The thread to pull on is narrower. Where in the runbook lifecycle does the loop close, and where does the human stay? The honest answer depends on the cost of a silent regression. A runbook that drafts internal Slack messages can probably auto-merge proposed changes. A runbook that touches customer comms cannot. The interesting design question isn’t “should we automate.” It’s: what’s the merge-policy interface for runbooks, and what does it look like to declare it explicitly?

Patterns Were the Map in the Search for Beauty

Jonathan Lebensold — Tue, 28 Apr 2026 17:18:34 GMT

In October 1996, Christopher Alexander gave the keynote at OOPSLA — the conference where object-oriented programmers gathered every year to argue about software design. He had been invited as the patron saint of the patterns movement. His 1977 book A Pattern Language was the conceptual foundation the Gang of Four had drawn on for Design Patterns a few years earlier. The whole field was, in a real sense, his.

He used the keynote to tell them they had missed the point.

The form of the pattern — name, context, problem, solution — was something software had borrowed cleanly. What had been left behind was the purpose. Patterns weren’t taxonomies. They were instruments for generating a quality Alexander was by then comfortable calling beauty: the thing that makes a courtyard feel inevitable, or a doorway feel right at five in the morning.

He told the room he hoped they would eventually get there. They mostly didn’t. The community kept refining the form — anti-patterns, pattern catalogs, books and books of templates — and Alexander spent the next decade writing The Nature of Order, four volumes on the structure of beauty in built systems. The two camps never really reconnected.

The thing patterns can’t capture

Engineers spent two decades cataloguing every shape they could find: Singleton, Observer, Strategy, Chain of Responsibility. Enterprise software followed the same logic at the next level up. ERP. CRM. Document management. Workflow engine. Identity provider. Each one a discrete category with its own pattern language and its own vendor.

Then ask any senior engineer to describe a real production system. You get something stranger.

A claims-processing system at a regional health insurer is, on inspection, maybe 30% ERP — it tracks line items, holds money, runs against an accounting period. 40% CRM — it tracks individuals, decisions about them, communications history. 30% document management — every claim has scanned attachments, every appeal has a paper trail. The categories vendors use to sell software don’t describe the categories work operates in.

Every senior engineer I know has lived this. The vendor sells you the ERP. You implement it. Six months in, the shape of the work isn’t ERP-shaped — it’s something else the vendor’s data model can’t quite hold — and the spreadsheets and Slack channels start filling in the rest.

A partition is not a whole

It’s tempting to stop there and say: real systems are blends of categories, and patterns don’t catch blends. That framing concedes too much. A blend implies a partition — that the system can be cleanly carved up into 30% of this and 40% of that, summing to 100%. It still gives the categories the dignity of being the right axes.

That’s not what’s actually happening. The ERP-ness of the claims system is not separate from the CRM-ness. The line items only make sense alongside the individual whose decision created them. The communications history hangs on the document trail. And the document trail is paper stapled to nothing without the line items it justifies. The categories don’t slice a pie. They are codependent centers, and the codependence is the system.

Alexander had a name for this. In The Nature of Order he describes fifteen fundamental properties that he claims contribute to the wholeness of a built artifact — strong centers, deep interlock and ambiguity, not-separateness, among others. Wholeness, in his framework, is what happens when the parts of a system strengthen each other rather than tile. A beautiful courtyard is not a sum of wall and bench and tree. It’s a configuration where the wall makes the bench feel like a place to sit, the tree makes the wall feel like shelter — and somehow, the bench makes the tree feel placed rather than planted.

A claims-processing system is the same kind of thing, when it works. When it doesn’t, you bought three vendors and integrated them, and you got something that diagrams cleanly and operates badly. The seams show. The work happens in spreadsheets.

This is why patterns were a dead end. A pattern is, by construction, a separable unit. It can be named, lifted out of context, and reused. That’s the entire point. But the property Alexander cared about — wholeness, beauty, deep interlock — is exactly the property that does not survive separation.

Map and terrain

There’s a phrase from general semantics — the map is not the territory — that captures the same point operationally. A pattern is a map. The named, separable thing the catalog points at. The terrain is the system that has to actually run, with all its interpenetrating centers and codependencies and exceptions.

Workflow software is where this gets expensive. A workflow encodes a map: this step, then that step, then this branch, then that approval. It assumes the work decomposes into known stages in a known order. Where the work fits the map, that’s industrial-strength leverage. Where it doesn’t — where the operation is interpenetrating rather than sequential — the workflow forces the terrain into the shape of the map. That’s where you get the joke at every enterprise: there’s the system everyone officially uses, and the spreadsheet where the work actually gets done.

Workflow software didn’t create that gap. It industrialized it. It encoded the map into the system itself, which made a bad map cheaper to enforce and harder to bend.

Jason Stanley writes about this in governance terms. He sorts work governance into four modes — procedural (verify the steps), credentialing (verify the actor), gatekeeping (verify the output), reputational (trust accumulated over time). Alexander’s own practice was reputational: clients on the Eishin School trusted him because he’d built things they recognized as alive. That mode doesn’t transfer to a non-human builder. When an agent walks onto the site, governance has to fall back to gatekeeping — verifying the output instead of trusting the actor.

Planting flags

Alexander’s practice on his bigger projects — the Eishin School outside Tokyo, the Mexicali housing — was not to draw a blueprint and hand it to builders. He walked the site. He planted flags where buildings would go. The plan came out of the topology, not the other way around. He was, in his own vocabulary, finding the centers — the places the site wanted a building — and letting the buildings emerge from there.

This is the move I think agents finally make available on the software side.

The way you hand an agent a task whose right answer depends on the specific terrain of your data, your customers, your existing systems is not by writing a workflow. It’s by giving it a runbook — a quality bar in markdown — and a rubric that measures the gap between output and bar, then letting it figure out the procedure. The procedural details — step order, retries, routing — stop being something a human has to specify in advance. You plant flags. The agent fills in the building. This is the bet I’ve been making with Jetty.

A journalist isn’t measured on her first draft. A designer isn’t graded on her sketches. They’re judged on what comes out the other side, against a standard their editor or art director can articulate. We’ve governed knowledge work this way for a hundred years. We’ve never been able to govern software systems this way, because the systems couldn’t read the standard. Now they can.

What’s a beautiful system, then?

I don’t have a clean answer. I keep running into it without recognizing it. A workflow that handles the easy 80% and falls back to a runbook for the rest. An evaluation pipeline whose rubric makes the codependence of the centers visible. An agent that produces a pull request that fits the codebase as if it had always been there. None specified by category. All specified by outcome.

Markdown rubrics work for a single task. The honest open question is what the outcome specification looks like at the level of a whole system. A whole organization.

I suspect Alexander would say it’s some version of beauty, and that we’d recognize it when we saw it. The patterns weren’t going to get us there. We’ve had thirty years to find out.

My Backend is 442 Lines of Markdown

Jonathan Lebensold — Tue, 21 Apr 2026 03:37:36 GMT

A few weeks ago we shipped mlcroissant.jetty.bot — paste in a URL to an academic dataset paper, and get back an MLCommons Croissant metadata file. It extracts the dataset’s provenance, fields, and license, then packages everything as machine-readable JSON-LD. The backend is a 442-line markdown file. No pipeline code. No DAG. No orchestrator. One structured document telling an agent what to do, plus a single API call to run it. The runbook is public at mlcroissant.jetty.bot/runbook. The repo is at github.com/jettyio/pdf2croissant.

Run returned “completed” but produced no files.

The runbook was correct. The plumbing was broken. I couldn’t see which until I had the trajectory data showing exactly where the agent stopped. That failure taught me more about writing reliable agent instructions than any blog post about agent reliability ever has.

I’ve written before about runbooks as the missing layer between “call this API” and “accomplish this outcome” — the structured document you’d write for a competent new hire who needs to run your pipeline while you’re on vacation. This is what that looks like when you actually build something with one, failures included.

353 lines was not enough

The first version ran to 353 lines. The agent would read the paper, sometimes generate valid JSON-LD, and declare victory. One output file out of three. Sometimes zero. Status: completed.

Iteration 1: I added a verification script. After generating the Croissant file, the agent runs bash that checks: does the file exist? Is it valid JSON? Does it schema-validate? Each check prints PASS or FAIL.

The agent would check its work, see FAIL printed right there, and declare success anyway.

Iteration 2 felt undignified but worked. I added aggressive mandate language: “MANDATORY — do not skip. You MUST produce all output files. No exceptions.” This sounds like yelling at a computer. Mechanically, though, it’s not about tone — it’s about probability mass. Agents sample from a distribution of reasonable next actions. Stronger language shifts the distribution. “Declare victory despite the FAIL line” becomes less probable; “fix the problem” becomes more probable.

Iteration 3: converted prose parameter descriptions to structured tables. The first version had a paragraph describing what a Croissant file should contain. The agent would fill in what it could find and stop when the description got vague. The table version listed every field explicitly. No ambiguity left to exploit.

Here’s an example output from the runbook transforming CoralVQA into Croissant. The task didn’t get harder, instead the instructions got more precise.

The meta-loop

After enough runs to see failure patterns, I started using Jetty’s /optimize-runbook command to accelerate the cycle.

/optimize-runbook reads execution trajectories from previous runs and proposes targeted changes to the runbook — not abstract suggestions but specific findings: “4 out of 7 runs failed at schema validation because the agent omitted containedIn on FileSet objects. Add it to the Common Fixes table.” Or: “There’s a nested shell quoting bug in the verification script. The agent’s jq call fails when the dataset name contains a comma.”

The MLCommons Croissant Runbook—taking PDFs and turning them into JSON

Instead of noodling through logs for thirty minutes, the skill found it in trajectory data in seconds.

What struck me is that this loop is structurally identical to what the runbook asks the agent to do: run, evaluate against criteria, identify the weakest point, make targeted fixes, run again. The difference is just which landscape you’re on. The agent is hill-climbing the output quality. I’m hill-climbing the instructions.

Same algorithm, one level up.

The responsible AI angle

Here’s why I think this matters beyond the exercise of building a web app with a markdown backend.

Most ML datasets ship with a PDF and maybe a HuggingFace card. Human-readable. Not machine-readable. If you want to audit what a model was trained on — provenance, licensing, known limitations, bias documentation — and the answer lives in hundreds of PDFs in different formats, you can’t do that audit at scale. You can gesture at it.

The EU AI Act requires documentation of training data for high-risk AI systems. The NIST AI RMF points in the same direction. Structured dataset documentation is becoming a compliance requirement, not just good practice. Croissant is the format that makes that compliance automatable rather than a documentation project someone has to own and maintain by hand.

The gap is real. Tens of thousands of dataset papers. Almost none of them have Croissant files. The research exists; the structured representation of it mostly doesn’t. As I’ve argued in my post on CI for AI, the infrastructure for treating AI systems rigorously has to exist before you can use it — and machine-readable dataset metadata is about as foundational as it gets.

The scale question

Now I’m wondering how we can efficiently process entire conference proceedings; 3,000+ academic papers and datasets become a single data point for structured question answering. That’s the next experiment I’m thinking about.The Model Arbitrage Opportunity

The final, and perhaps most valuable, angle to this approach is model arbitrage. By defining the agent’s task in a highly structured, portable runbook (a markdown file, in this case), we decouple the instruction from the specific agent or model that executes it.

The RUNBOOK.md becomes the singular, high-quality asset. I used a capable, often more expensive model to write and optimize the runbook—leveraging its superior reasoning to find failure patterns and refine the instructions over dozens of runs. This is the high-cost, high-value authoring phase.

Once the runbook is reliable, as our 442-line version is, the execution cost plummets. The refined runbook can be tested, evaluated, and executed by a smaller, cheaper, and faster model or agent that can simply follow the explicit, crystal-clear instructions. The runbook’s aggressive mandate language, structured tables, and explicit verification steps remove the need for constant high-level reasoning. The expensive model creates the reliable path; the cheaper model walks it.

This enables a clear model arbitrage strategy: pay for peak intelligence once to create the robust instructional layer, and then deploy for a fraction of the cost across thousands of execution runs, achieving reliability without the continuous expense of a top-tier model. It turns the runbook into a universal instruction set, allowing us to swap the underlying agent—the model—as new, cheaper, or more specialized options become available.

The Jagged Frontier Is an Evaluation Problem

Jonathan Lebensold — Mon, 13 Apr 2026 21:14:50 GMT

I was listening to The Economist’s Boss Class podcast a few months ago when I heard Ethan Mollick describe something I’d seen a dozen times but never had a name for. He called it the jagged frontier; the idea that AI capabilities don’t advance in a smooth, predictable line. A system that can write a compelling 2,000-word analysis of a balance sheet might fail to add up a column of five numbers. Not because it’s generally capable or generally incapable. Because its competence boundary is jagged. Full of unexpected peaks and cliffs.

I immediately thought of a team I’d worked with that built a financial analysis tool. Impressive thing — it could compare revenue multiples across a peer group, flag accounting anomalies, summarize MD&A sections in plain English. Genuinely useful. Then someone asked it to add up the quarterly earnings figures it had just pulled. It got the wrong answer. Not by a little. By a lot. The model had no idea why this was embarrassing.

Smooth vs. jagged

When humans have expertise, that expertise tends to be correlated. An accountant who’s excellent at financial analysis is probably decent at arithmetic. Not because accounting requires arithmetic (it does), but because the skills develop together, from the same training, in the same career. Competence in one area is a weak signal of competence in adjacent areas.

AI systems don’t work this way. They have no such correlation. A model trained on vast amounts of financial text learns comparative analysis from millions of examples of comparative analysis. Whether it can add numbers depends entirely on whether the training process developed that particular capability — which is a separate question, running on a separate axis.

This is counterintuitive in a specific, dangerous way. When you hire a financial analyst and they produce excellent comparative analysis, you trust their arithmetic. That trust is almost always warranted. When you deploy an AI system that produces excellent comparative analysis, you might make the same inference. That inference is not warranted. And you won’t know it isn’t warranted until something goes wrong.

The BCG study

In 2023, Ethan Mollick and his colleagues at Harvard ran an experiment with 758 BCG consultants. Pre-registered, large sample — the kind of study that HN usually dismisses and couldn’t in this case. Consultants were assigned tasks either inside or outside AI’s frontier.

Inside the frontier: AI-assisted consultants produced work that was about 40% higher quality. They finished 25% faster. The productivity gains were real and substantial.

Outside the frontier — tasks that looked superficially similar but fell outside what the AI could actually do well — a different story. AI-assisted consultants made 19% more errors than their unassisted counterparts. Despite working faster. The tool was confidently wrong in ways the consultants didn’t catch, because the outputs looked like outputs that should be right.

That’s the jagged frontier in a controlled experiment. The productivity gain and the quality loss exist simultaneously, for different tasks, in the same system.

Three zones, not two

The intuitive mental model for AI deployment is binary: tasks where AI helps, tasks where it doesn’t. Deploy where it helps, don’t deploy where it doesn’t. But the BCG study points to a third zone that’s more dangerous than either. Not “AI outperforms humans” and not “humans outperform AI” — there’s a third zone of tasks nobody thought to measure, where AI fails in ways nobody anticipated. Tasks that feel similar to the ones AI is good at. Tasks the team didn’t benchmark because they assumed they were covered. Tasks that only surface as failures once the system is in production.

This is the zone that catches teams off guard. Not the obvious failures. The non-obvious ones — the tasks that sit adjacent to your evals, just outside the mapped territory, invisible until a user finds them.

I think about the car wash story that circulated a while back. Someone asked an AI assistant to navigate them to the nearest car wash, and the model gave walking directions. Fifty meters. The model couldn’t reason about the context — that someone asking about a car wash is almost certainly in a car, and nobody walks their car to a car wash. To the model, it was a navigation question. It answered the navigation question. It was completely wrong about what the user actually needed.

Funny as an anecdote but not when that’s the behavior from your customer support agent, your document processor, your financial analysis tool, doing the same thing to real users.

Why this is an evaluation problem

Here’s the structural issue. The third zone — the unmapped dangerous zone — is invisible to standard evaluation because you don’t know how to test it.

Most teams build their eval sets before deployment. You think about what the system should do, you write test cases, you run them against the system, you measure quality. What you’re measuring is performance on the tasks you thought of. The jagged frontier bites you on the tasks you didn’t think of.

This is another reason why static gold datasets fail — I’ve written about this before. Gold datasets can only cover the frontier as it existed when you built them. But the frontier shifts with every model update. A task that was inside your capability boundary with GPT-4 might be outside it with GPT-4o. Or vice versa. Your static benchmarks tell you nothing about the new cliffs.

And it gets worse with long-running workflows. Atomic tasks are easier to evaluate — you get an output, you check it. Multi-step workflows can look fine at every intermediate stage and fail at the aggregate. The jagged frontier can bite you at step three of a six-step process, and you won’t see it in your step-level metrics. You’ll see it in your support tickets.

Example: call centers

Call centers have been dealing with a version of this problem for decades. Not AI, but agents — human ones — whose competence is inconsistent and whose performance needs to be measured task by task, not just on average.

Good call center operations don’t measure “are our agents generally good.” They measure resolution rate per call type, handle time per call type, customer satisfaction per call type, escalation rate per call type. Per agent. Per shift. Per product line. They know, with specificity, which agents handle billing disputes well and which ones need support for technical escalations. The evaluation infrastructure is granular enough to find the edges.

This is what good AI evaluation looks like. Not “our system is generally performing at 87% quality” — that number hides the jagged frontier. You need quality broken down by task type, by input category, by workflow stage. You need to know where the peaks are and where the cliffs are. And you need to update that map every time the system changes, because the frontier shifts.

The challenge is that most teams don’t have this infrastructure. They have aggregate metrics. Maybe a few spot checks. An eval set that hasn’t been refreshed in three months. When a failure surfaces, they discover it the way call centers discovered problems before they had analytics — through a supervisor listening in on a bad call, or through a customer who complained loudly enough.

What grounded evaluation requires

Mollick uses the phrase “grounded quality definitions” in his work on management as an AI superpower, and it’s worth unpacking. A grounded definition ties quality to real outcomes, not to the output’s surface characteristics. Not “did the response sound confident” or “was the response coherent” but “did the user accomplish what they were trying to accomplish” or “was the calculation correct.”

This turns out to be harder than it sounds, and it’s where most eval infrastructure falls down. It’s easy to build evals that measure proxy metrics — coherence, length, similarity to a reference answer. It’s harder to build evals that measure whether the system is actually doing the right thing.

Two requirements that I keep coming back to. Quality has to be grounded: tied to real outcomes, not proxy signals. And task completion has to be verifiable: you can actually check whether it happened, independent of the system’s confidence. A financial analysis is verifiable if you can check the arithmetic. A navigation response is verifiable if you can test whether the route makes sense for someone in a car. A customer support response is verifiable if you can check whether the underlying issue got resolved.

Where these two conditions hold, you can build real evaluation infrastructure. Where they don’t — where quality is subjective or completion is hard to verify — you’re flying blind, and the jagged frontier will find the edges of your understanding before your evals do.

The frontier shifts

One more thing that makes this hard: the frontier isn’t static.

Every model update moves it. Some capabilities improve. Some degrade. Some new failure modes appear that didn’t exist before. The BCG study was run at a specific point in time with specific models. The specific numbers (40% quality improvement, 19% more errors) are already dated. The structural insight — that AI performance is jagged, with benefits and risks that aren’t correlated — will remain true for the foreseeable future, even as the specific contours of the frontier change.

This is why evaluation can’t be a one-time activity. You can’t benchmark your system in January, deploy it, and assume the frontier stays put. It doesn’t. Model updates are the obvious trigger, but user behavior drifts too — the distribution of inputs your system sees in month six is different from what it saw in month one. The task mix shifts. New use cases get discovered. What was inside the frontier may now be outside it, and vice versa.

With Jetty, we’re tackling this with agent runbooks, but we’re not alone in trying to build agentic hill-climbing as a service.

The “fix everything and measure once” assumption is exactly the wrong model. The right model is continuous: define quality, measure it, improve it, measure again. A living process, not a checkpoint.

The question worth sitting with: if the jagged frontier is real, and measurable, and shifts with every model update — what would it actually take to map it before your users discover the cliffs for you?

I don’t think the answer is more dashboards. Dashboards show you aggregate metrics on the tasks you’re already measuring. The dangerous zone is the unmapped territory. Finding it requires deliberately probing the edges — constructing evals for tasks you think are covered but haven’t tested, running them when models update, building the granular per-task quality infrastructure that call centers take for granted.

It’s boring work. It’s not a new feature. Nobody’s going to celebrate the eval suite you built. But it’s the difference between knowing your system’s frontier and hoping you’ve been lucky about where the cliffs are.

Visual workflows are procedural programming in a costume

Jonathan Lebensold — Tue, 31 Mar 2026 13:56:45 GMT

I’m always blown away looking at an agent described as an visual workflow. The boxes and lines expanding a collapsing are a joy to witness, and they scratch the same part of my brain that obsessed over tech trees in real-time strategy games. A system that answers questions is defined as dozens of lines and boxes that culminate in a summary in a markdown file.

By the time they’re done, the canvas looked like a circuit board designed by someone having a bad day. Forty-something nodes. Arrows crossing arrows.

You can see everything while simultaneously understanding almost nothing.

Boxes and arrows are just code in a costume

Here’s something that took me too long to realize about visual AI workflow builders. Strip away the drag-and-drop interface and what’s underneath is a control flow graph. Boxes are functions. Arrows are return values. Conditionals are if-statements. Retry loops are while-loops.

It’s procedural programming. The same paradigm as Python or TypeScript, just rendered as a diagram instead of text.

This matters because the visual layer doesn’t change the paradigm. It changes the representation. And the new representation is worse in almost every way that matters for production systems: harder to version control, harder to diff, harder to review, harder to refactor, harder to compose.

Over 25 years ago, L. Peter Deutsch watched a talk on visual programming and made an observation that became known as the Deutsch Limit: “The problem with visual programming is that you can’t have more than 50 visual primitives on the screen at the same time. How are you going to write an operating system?”

He was talking about general-purpose software. AI workflows hit that limit much faster. A moderately complex agent with tool calls, branching logic, evaluation steps, retry loops, and error handling blows past 50 nodes before you’ve handled the happy path. The visual representation that was supposed to make the system legible makes it less legible. The diagram becomes the thing you need a diagram to explain.

The shift nobody is talking about

The software industry has been through this exact transition before. More than once.

When you write SELECT * FROM orders WHERE total > 100, you don’t specify which index to use. You don’t tell the database whether to do a sequential scan or a hash join. You don’t manage memory allocation or disk I/O. You describe the result you want, and the query optimizer — decades of engineering condensed into a planner — figures out the execution path. This is what makes SQL so durable. The same query runs on SQLite and on a distributed Snowflake cluster. The what stays the same. The how adapts to the context.

Terraform did the same thing for infrastructure. Before Terraform, deploying infrastructure meant writing imperative scripts: create this server, configure this load balancer, attach this security group, in this order, and hope nothing fails halfway through. Terraform replaced all of that with a declaration: “I want 5 instances behind a load balancer in us-east-1.” The system reads your desired state, compares it to reality, and converges.

Kubernetes did it for container orchestration. “I want 3 replicas of this service, always.” Not “launch a container, check if it’s healthy, restart it if it crashes, scale up if load increases.” You declare the outcome. The system maintains it.

Every one of these transitions followed the same arc: procedural tools that worked fine at small scale became unmanageable as complexity grew. Declarative tools replaced them not by doing the same thing with a nicer interface, but by operating at a different level of abstraction.

What’s the word for the version of this shift in AI evaluation? I think it’s happening right now, and most people are building on the wrong side of it.

The plumbing problem

Now look at how visual AI workflow tools handle evaluation. Take a concrete example: Vellum’s recommended architecture for a RAG evaluation loop. You wire a Prompt Node to a Guardrail Node that runs an evaluation metric at runtime — say, Ragas Faithfulness. If the score is below your threshold, you route the failure path through a Conditional Node back to the Prompt Node for another attempt. You need a Templating Node to track the Prompt Node’s execution counter so the loop doesn’t run forever. You add a second Conditional branch for when retries are exhausted. You add a Try adornment on the Prompt Node to expose an error output for non-deterministic failures. You wire the success path through a Merge Node to a Final Output Node.

Example of an N8N workflow for scheduling a calendar invite.

That’s seven node types — Prompt, Guardrail, Conditional, Templating, Merge, Final Output, and Error — plus adornments, for a single evaluate-and-retry loop on a single output. And this is the recommended pattern. If you want to evaluate multiple dimensions (factual accuracy and tone and completeness), each dimension needs its own Guardrail Node, its own branch, its own merge. The graph doesn’t grow linearly. It grows combinatorially.

Other tools have the same problem. On one community forum, a user asked how to implement a conditional while loop — a basic evaluation retry. The answer: chain a Loop node to an IF node to a Wait node, wire the true path back to the loop, and “be very careful on trigger infinite loop.” That’s the level of primitive you’re working with.

You’ve specified the procedure, step by excruciating step. If you want to change the evaluation criteria, you’re re-wiring nodes. If you want to add a rubric dimension, you’re adding boxes and arrows. If you want to change the retry strategy, you’re restructuring the graph.

That’s not evaluation. That’s plumbing. And just like imperative deployment scripts before Terraform, the plumbing obscures the thing that actually matters: what does “good” look like?

What declarative evaluation looks like

Here’s an evaluation specification from a Jetty runbook I use in production:

The output must contain these 7 files. Each file must pass schema validation. The narrative summary must score 7+ against these rubric dimensions: factual accuracy, clinical relevance, actionable recommendations, appropriate tone. If it scores below 7, identify the weakest dimension, revise, and re-evaluate. Maximum 3 iterations.

That’s the entire evaluation logic. Twelve lines of markdown. No boxes. No arrows. No retry-loop wiring. No conditional branches.

The agent reads this and does what a competent human would: produces the output, checks it against the criteria, fixes what’s weak, and iterates until the bar is met or the budget is exhausted. The what is specified precisely. The how is left to the executor.

Compare this to the visual workflow version of the same logic. I counted the nodes in a visual implementation of this evaluation loop: 23 nodes, 31 connections, and I still hadn’t handled the case where schema validation fails on a different file than the one the rubric scored lowest.

The declarative version is not just shorter. It’s a different kind of thing. It’s a specification, not a procedure. And specifications have properties that procedures don’t.

Specifications compose. Procedures tangle.

Want to add a new rubric dimension to a declarative evaluation? Add a line. Want to change the iteration limit? Change a number. Want to swap the underlying model? Change a parameter. Want to apply the same evaluation criteria to a different pipeline? Copy the specification and change the input path.

Try any of these in a visual workflow builder. Adding a rubric dimension means adding evaluation nodes, re-wiring branches, and testing the new graph. Changing the iteration limit means restructuring the retry loop. Swapping the model might mean replacing nodes that have different input/output schemas. Applying the evaluation to a different pipeline means... rebuilding the workflow from scratch, because the node graph doesn’t separate the evaluation logic from the execution context.

This is why nobody writes Terraform in a drag-and-drop GUI. The text representation is more powerful, not less. Not because text is inherently better than graphics — but because declarative text separates what from how in a way that visual procedures cannot.

SQL views compose because each one is a self-contained query with declared dependencies. Terraform modules compose because each one is a self-contained state declaration. Kubernetes manifests compose because each one is a self-contained desired-state spec. Evaluation specifications compose because each one is a self-contained quality bar.

Visual workflow nodes don’t compose. They connect. And connections are fragile. Move one node and three arrows break.

The version control problem is fatal

Here’s the tell that the visual approach has a fundamental problem: every serious visual workflow tool eventually builds a text-based SDK.

One platform launched a Workflows SDK with a CLI that does “bi-directional syncing” between the canvas and code. Another open-sourced a text format alongside its visual editor. A third supports workflow-as-code. They all end up in the same place: acknowledging that the visual representation isn’t the source of truth. It’s a rendering of the source of truth. And the source of truth is text.

The reason is version control. Visual workflows serialize to JSON blobs — hundreds or thousands of lines of auto-generated coordinates, node IDs, and connection metadata. You cannot meaningfully diff them. You cannot code review them. You cannot grep for the line where the evaluation threshold is defined. When two people edit the same workflow, you cannot merge their changes.

The right abstraction to visualize is outcomes: what does the evaluation measure, what’s the quality bar, and what are the iteration bounds. The procedure is an implementation detail the system should handle, the same way a SQL query optimizer handles join ordering.

Promptfoo seems to get this. Its evaluation framework uses declarative YAML — you specify assertions (string matching, LLM-as-judge rubrics, schema validation) and it handles execution. It’s closer to a testing DSL than a visual workflow. That’s the direction the ecosystem should be heading.

The parallel to testing

This connects to something I’ve been writing about for a while. CI/CD worked because it made every change small, testable, and reviewable. The CI loop depends on artifacts that diff cleanly: code in text files, tests in text files, configuration in text files.

Visual workflow builders fight this at every level. Changes don’t diff. Reviews are “look at my screen and tell me if this graph looks right.” Rollbacks mean “restore the previous version of an opaque JSON blob.” The very properties that made CI/CD transformative — small changes, clean diffs, automated testing, code review — are the ones that visual workflows undermine.

Declarative evaluation aligns with CI because it produces the same kind of artifacts. A runbook is a text file. It diffs. It reviews. It lives in a PR. When you change the evaluation criteria, the diff shows exactly what changed and nothing else. When you add a rubric dimension, the reviewer can assess whether it makes sense without understanding the execution plumbing.

This is the same advantage Terraform has over bash deployment scripts, and SQL has over hand-rolled data-processing code. The artifact is reviewable because it describes intent, not mechanism.

Procedural visual tools will always address an important segment of the market. Some tasks genuinely need step-by-step control — data plumbing, API integrations, deterministic pipelines where every step is known in advance. That’s real, and visual builders serve it well.

But most AI tasks aren’t like that. Most AI tasks are measured by their outcome, not by whether you followed the right steps. Did the summary capture the key points? Did the generated image match the brand guidelines? Did the evaluation catch the regression? These are outcome specifications, not procedure specifications. It’s the difference between listing the ingredients for a dish and prescribing exactly which store to visit, which aisle to walk down, and which hand to reach with.

The most sophisticated agent orchestration I’ve seen isn’t a graph. It’s a markdown file with a rubric, an output manifest, and a verification script. It version-controls perfectly. It diffs in a PR. It’s human-readable and machine-executable. It composes by copy-paste and edit, the way SQL views do.

The format sounds primitive, but so did SELECT * FROM when compared to a visual query builder. The power was never in the interface. It was in the abstraction.

Where visual workflows make sense

It’s not all bad! There are some cases where a visual workflow tool can be a huge productivity gain. MaxMSP is a fantastic example of how a visual workflow tool can actually increase visibility into what’s going on under the hood. Tools like After Effects, Blender and many shader mapping interfaces for game designers are also good examples. But I think in each of these instances, an expert has to be prepared to learn and adopt the visual programming environment wholesale—and they still face an upper bound in terms of how much complexity any of these visualizations can communicate to humans.

Runbooks: what agents need to hill-climb

Jonathan Lebensold — Fri, 27 Mar 2026 01:59:33 GMT

Last month I watched an agent run a six-step evaluation pipeline. It called the right APIs. It generated SQL that was mostly correct. It even caught a schema error and fixed it on the second try. Then it wrote a summary, declared the task complete, and stopped.

It had skipped two of the six steps entirely. The output directory was missing three of five required files. The summary confidently described results from steps that never ran.

This is the failure mode nobody talks about. Not “the agent can’t do the task.” The agent can do the task. It just doesn’t finish it. It encounters an error on step four, routes around it, produces whatever it can, and wraps up with the confidence of someone who definitely completed all the work.

If you’ve built anything non-trivial with coding agents, you’ve seen this. The agent is capable but unreliable. It needs something between a one-line instruction and a shell script. It needs what I’ve started calling a runbook.

Skills, workflows, and the gap between them

Most agent tooling falls into two buckets.

Skills are single-turn instructions. “Here’s how to call the Jetty API.” “Here’s how to query Snowflake.” They’re reference cards. Useful, but they don’t handle multi-step processes where the output of step three determines what you do in step four.

Workflows are fixed pipelines. Step A feeds step B feeds step C. They’re deterministic, which is their strength and their limitation. When the task requires judgment — “is this SQL output correct?” or “does this image match the brand guidelines?” — a workflow can’t adapt.

That’s not a skill. It’s not a workflow. It’s a process that requires both execution and judgment, with the ability to recover when things go wrong.

What a runbook actually is

A runbook is a structured markdown document that tells an agent how to accomplish an outcome. Not a procedure to follow blindly. An outcome to achieve, with enough guidance to get there.

The distinction matters. A shell script says “run these commands in this order.” A runbook says “here’s what must be true when you’re done, here’s how to evaluate whether you’re there, and here’s what to do when you’re not.”

The critical sections, in order:

An objective. Two to five sentences that answer: what am I doing, what am I producing, and for whom? The agent should be able to read this in ten seconds and orient.

An output manifest. The exact files that must exist when the task is complete. This is deliberately aggressive:

You MUST write all of the following files. The task is NOT complete until every file exists and is non-empty. No exceptions.

That tone exists for a reason. Agents are polite. They want to wrap up gracefully even when they haven’t finished. The manifest is a forcing function against the premature completion problem I described at the top.

Evaluation criteria. How the agent knows whether its output is good enough. This is the section that separates a runbook from a to-do list.

An iteration loop. What to do when evaluation fails. Try again, but differently, and with a cap on how many times.

A final checklist with a verification script. A bash script that checks every output file exists and is non-empty, plus a prose checklist the agent walks through before declaring completion.

That last part — the script — is the only reliable way to prevent the failure I opened with. Without it, the agent will skip steps and tell you everything went great.

The hill-climbing loop

Every runbook contains at least one judge-refine-rejudge cycle. The agent produces output, evaluates it against criteria, and iterates if it falls short.

This is the same hill-climbing pattern that works in optimization: define a quality bar, measure against it, improve the weakest dimension, measure again. The runbook just makes it explicit and bounded.

Bounded is the key word. Without a cap, agents will iterate forever or give up after one attempt. Three rounds is the sweet spot I’ve landed on. Enough to converge on most issues, not enough to burn through your API budget on a lost cause. The runbook specifies what happens when you hit the ceiling: keep the best attempt, flag for human review, or both.

Two evaluation patterns cover almost everything:

Programmatic validation for structured output. Does the JSON schema-validate? Does the SQL execute? Do the tests pass? Error messages are specific and actionable, so the agent converges in one or two rounds.

Rubric-based judgment for creative or complex output. Score against multiple criteria on a 1-5 scale, with a pass threshold (like “overall >= 4.0, no criterion below 3”). The agent identifies the weakest criterion and makes targeted improvements. A “Common Fixes” table maps failures to concrete actions — this is where you encode the domain expertise that prevents the agent from thrashing.

The pattern you choose depends on the output. Don’t rubric-score a JSON file. Don’t schema-validate a marketing graphic.

The new-hire test

Here’s the mental model I use. A runbook is what you’d write for a competent new hire who needs to run your pipeline while you’re on vacation.

You wouldn’t write a shell script. Too brittle — the first unexpected error kills it. You wouldn’t just say “figure it out.” Too vague — they’ll make assumptions you’d never make. You’d write something in between: the process with enough detail to recover from common failures and enough latitude to adapt when something unexpected happens.

You’d include the API calls they’ll need, with examples. You’d describe what good output looks like. You’d list the things that commonly go wrong and how to fix them. You’d tell them exactly which files to produce and how to verify they’re correct before calling it done.

That’s a runbook. The agent is the new hire. The markdown is the document you leave behind.

Tips are earned, not invented

The last section of every runbook is “Tips.” These aren’t generic best practices. They’re hard-won operational knowledge from watching agents actually run the process and fail.

Things like: “Langfuse auth uses HTTP Basic, not Bearer — agents default to Bearer and get a confusing 401.” Or: “Snowflake function names differ from Spark. If the SQL references ARRAY_AGG, the agent will need to use ARRAY_CONSTRUCT instead.”

These accumulate over time. Each failed run teaches you something the next version of the runbook should encode. The tips section is the runbook’s institutional memory — the things you’d tell the new hire over coffee that aren’t in any documentation.

Try this now

This isn’t theoretical. The Jetty agent-skill ships with tooling for running and validating runbooks.

validate-runbook.sh checks structural completeness without executing anything. It tells you whether your runbook has all the required sections, whether your template variables are declared, whether your evaluation criteria exist. Think of it as a linter for operational documents.

run-runbook.sh reads a parameters JSON, injects template variables, and invokes the agent with the runbook as its instruction set. It supports a --dry-run mode where the agent reads the runbook and produces an execution plan without making any API calls — useful when the pipeline involves expensive operations like image generation or database queries.

The barrier to entry is: write a markdown file with the sections above, validate it, and run it. If you’ve already got a process you run manually or a pipeline that an agent keeps botching, that’s your first runbook.

The contrarian bet

There’s an irony here. While the industry is building increasingly complex agent frameworks — tool chains, memory systems, multi-agent orchestration, graph-based planners — the most reliable guidance mechanism I’ve found is a well-structured markdown file with a verification script at the bottom.

Markdown is plain text. It works with every agent. It’s version-controlled. It’s diffable. It’s readable by humans and machines. It doesn’t require a runtime, a framework, or a dependency. You can review it in a PR.

The sophistication isn’t in the format. It’s in what the document encodes: clear evaluation criteria, bounded iteration, concrete output requirements, and operational knowledge from real failures. That’s what makes an agent reliable. Not the orchestration layer. The quality of the instructions.

I suspect this will seem obvious in retrospect. The same way “just write tests” seems obvious now but was a hard sell in 2005. The discipline is in writing down what “done” looks like before you start — and giving the agent a way to check its own work.

What’s less obvious is where the ceiling is. How complex can the task be before a single markdown file stops being sufficient? I don’t know yet. But the tasks I’ve thrown at runbooks — evaluation pipelines, data ingestion, brand compliance checking, regression testing — keep working. The format scales further than I expected.

The question I’m sitting with: if the best agent orchestration is a document with a rubric and a bash script, what does that tell us about where the real leverage is in AI systems?

Generation Got Cheap. Verification Didn't.

Jonathan Lebensold — Wed, 11 Mar 2026 17:06:50 GMT

Last spring, shipping with GPT-4o meant budgeting $5 per million input tokens. Enterprise teams paying for Claude Opus were spending $15 input, $75 output. Twelve months later, GPT-5.2 sells for $1.75 input. Anthropic slashed Opus by 67%. Google’s Flash-Lite is down to $0.075 per million tokens. DeepSeek cut prices by half and now lists $0.28 input.

The MTok crash is real. Across every weight class, generation costs fell 30–80% in a year.

But it misses the harder question. If generating output is approaching commodity pricing, what’s still expensive?

The cost nobody talks about

A team I work with runs a support chatbot. Last year it cost them about ±$18,000 a month on GPT-4o. When GPT-5.2 dropped at a third of the price, they migrated. The inference bill fell to roughly $6,000.

What didn’t change: the three engineers who review escalated cases. The QA process that spot-checks 2% of responses. The weekly meeting where someone eyeballs a dashboard and says “looks fine.” The verification layer stayed exactly the same size while the output tripled.

They didn’t save $12,000. They took $12,000 worth of increased risk.

This is happening everywhere. A recent paper from MIT formalizes what practitioners already feel. Christian Catalini, Xiang Hui, and Jane Wu model the AI transition as a collision between two cost curves:

The cost to generate (what they call c_A) is driven by compute and accumulated knowledge. As both scale, the cost drops exponentially. That’s the MTok crash in the table above.

The cost to verify (c_H) is driven by human time, feedback latency, and expertise. It’s bounded by biology. An experienced engineer can review traces faster than a junior one, but experienced engineers are scarce, and their wages rise with scarcity. The paper calls this “verification cost disease”: even when experts get more efficient, verification gets more expensive because the demand for their judgment grows faster than the supply.

These curves are diverging. Generation costs collapse. Verification costs stay flat or rise. The gap between them is widening.

The gap has a name

Catalini et al. call it the Measurability Gap: the growing share of tasks where machines can cheaply generate output that humans cannot affordably verify. Their argument is that this gap, not the price of tokens, is the binding constraint on productive AI deployment.

Think about it in terms of your own system. Cheap tokens don’t just save money. They change what’s economically viable to automate. When GPT-4o cost $5 per million input tokens, teams were selective about which tasks they automated. They routed FAQ queries to the model and left complex cases to humans. The generation cost acted as a natural filter.

At $1.75 per million tokens, the filter dissolves. Teams start routing everything through the model. Not just FAQs but nuanced customer complaints, refund decisions, edge cases that used to get flagged for review. Each individual decision is defensible: the model handles it well enough, and it’s so cheap there’s no cost argument against it.

But “handles it well enough” is an impression, not a measurement. Nobody increased the verification budget to match the expanded scope. The team still reviews the same 2% sample. The same three engineers attend the same weekly meeting. And the fraction of output that’s actually verified shrinks with every new task the model picks up.

The paper formalizes this as four zones:

Safe Industrial. Cheap to automate, cheap to verify. This is where AI success stories live. Chatbots answering FAQs. Classification tasks with clear ground truth. Code formatting. The output is easy to check, so automation works.

Runaway Risk. Cheap to automate, expensive to verify. This is the zone that expands as token costs crash. The model can generate the output, but proving it’s correct requires human expertise that doesn’t scale with compute. Legal summaries. Medical triage suggestions. Financial recommendations. Content moderation at volume.

Human Artisan. Expensive to automate, cheap to verify. Humans still do it better, and you can tell when they do. This zone is shrinking as models improve, but it’s where craftspeople live today.

Pure Tacit. Expensive to automate, expensive to verify. Strategy. Judgment under uncertainty. The work that’s hard to even define, let alone evaluate.

The Runaway Risk zone is the one that should worry you. Every time a model gets cheaper, more tasks cross the automation threshold. But the verification threshold doesn’t move. The zone grows.

What happens in the gap

I’ve watched this play out in specific ways.

A marketing team I know went from producing 50 assets per campaign to over 4,000. Same team size. Same approval process, in theory. Three people can’t review 4,000 assets. They spot-checked, trusted the prompts, and shipped. When an off-brand image made it into a campaign, nobody could trace it back to the generation run that produced it.

That’s the Runaway Risk zone in action. Generation cost dropped to near zero. Verification didn’t scale to match.

Or take model swaps, the most common optimization move. A new model is cheaper. It scores higher on benchmarks. An engineer tests it against a handful of examples, everything looks good, they ship it. Three weeks later a support ticket arrives for a failure mode the old model handled correctly. Nobody connects the ticket to the swap because it doesn’t look like a regression. It looks like a new bug.

I wrote about this pattern in an earlier piece. Teams optimize individual steps without measuring the whole system. Each change is defensible in isolation. In aggregate, the system isn’t meaningfully different from where it started. Cheaper tokens accelerate this cycle. More model options, more swaps, more lateral moves disguised as progress.

The Catalini paper has a blunt name for unverified output: a “Trojan Horse” externality. It looks like productive work. It satisfies the metrics you’re tracking. But it accumulates hidden risk in the gap between what you can measure and what you can’t.

Where the savings should go

Here’s the math most teams don’t do.

That support bot dropped from $18,000 to $6,000 a month. The $12,000 savings is real. The question is where it goes. Most teams take it as margin. Finance is happy. The line item went down.

But the team also expanded the bot’s scope. It handles 3x more query types. The generation budget decreased while the verification burden increased. If none of that $12,000 flows into evaluation infrastructure, the team hasn’t saved money. They’ve converted an explicit cost (tokens) into an implicit one (undetected failures).

What does reinvesting look like in practice?

Eval suites on production traces. Not the handful of test cases assembled six months ago. Evaluation sets built from what the system actually encounters, refreshed continuously, covering the edge cases that model swaps introduce. If your eval set hasn’t changed since the last model migration, it’s lying to you.

LLM judges for tasks humans can’t review at volume. A second model scoring the first against a written rubric. Does this response match our tone? Did the summary preserve the key facts? Is this classification consistent with how we’ve handled similar cases? Judges don’t replace human review, but they extend it. They catch the mechanical failures before a human ever has to look.

Automated pipelines that flag regressions before users do. The loop I keep coming back to: ingest traces from production, run evaluations against them, surface the findings as concrete changes an engineer can review. Not a dashboard. A diff. Something that lives in the workflow you already have.

This is the work we do at Jetty. We ingest traces from observability platforms, run evals, and produce pull requests with verified improvements. The MTok crash makes this more urgent, not less. When generation was expensive, the verification gap was narrow. Teams automated selectively and could review what they shipped. As generation approaches commodity pricing, the gap widens. The teams that invest in closing it will capture the surplus from every price cut. The teams that don’t will race to the bottom on inference costs and wonder why their systems aren’t getting better.

The question worth asking

The MTok crash didn’t make AI cheap. It made generation cheap. Those are different things.

Verification is the new scarce input. The ability to prove your system is working, to catch regressions when you swap models, to evaluate at the scale of your output rather than the scale of your team. That’s what’s still expensive. And unlike tokens, it doesn’t get cheaper by waiting for the next model release.

Every team I talk to has a version of the same plan: “We’ll optimize our AI stack once we have time.” The MTok crash gives them the budget. But budget without verification infrastructure is just faster generation of output nobody’s checking.

The question isn’t “which model should we use now that everything’s cheaper?” It’s “do we have any way to know if our system is working at the scale we’re running it?”

If the answer is no, cheaper tokens just means you’ll be wrong faster.

How to Reliably Generate Content

Jonathan Lebensold — Thu, 05 Mar 2026 17:56:40 GMT

A marketing team I know used to produce about 50 assets per campaign. Social cards, email variations, ad units. A designer, a copywriter, and a brand manager reviewed every piece. Slow, but reliable.

Last quarter they generated over 4,000 assets.

Same team size. Same approval process — in theory. In practice, three people can’t review 4,000 assets. They spot-checked, trusted the prompts, and shipped.

This is everywhere now. Coca-Cola’s Fizzion platform produces personalized imagery across 100+ markets. Moet Hennessy scaled to 3 million content variations globally. The generation capacity is unbounded. The review capacity hasn’t changed at all.

What happens when nobody’s checking

We don’t have to imagine this. Coca-Cola’s AI-generated holiday campaign drew immediate backlash — “soulless,” “creepy,” characters that looked almost human but not quite. Google Gemini generated historically inaccurate images. Meta launched AI-generated fake profiles and retracted them within days.

These aren’t startups. These are companies with massive brand teams and explicit guidelines. They shipped anyway, because the speed of generation overwhelmed the humans who used to catch these problems.

And a 2025 study from the Nuremberg Institute found that just labeling content as AI-generated makes people view it as less trustworthy. Bad AI content doesn’t just look bad — it actively erodes brand trust.

LLM judges are the only path that scales

If humans can’t review 4,000 assets, the only option is another AI. Uncomfortable conclusion, but it’s arithmetic.

Coca-Cola got here first. Fizzion encodes 140 years of brand rules — the red hue, logo spacing, typography — into machine-readable metadata called StyleID. The AI that generates content is constrained by the AI that evaluates it. It’s now mandatory for all their agency partners.

Jasper’s Brand IQ does something similar — an LLM judge that flags violations and suggests on-brand replacements in real-time. Acrolinx scores content against your style guide and tone of voice. These aren’t research projects. They’re production systems.

The pattern is LLM-as-judge: define criteria, feed the output to an evaluator, get a structured score. Does this copy match our voice? Does this image follow our guidelines? Is the tone right for this market? Even if you heavily discount the vendor claims, an automated judge that catches 80% of violations beats the current reality of reviewing 2% and hoping.

Traceability is the other shoe

There’s a second problem beyond quality: provenance. When you produce 4,000 assets from a base concept, you need to trace any one of them back to the prompt that generated it, the model version, the brand rules, and the judge score it received.

The EU AI Act, taking effect August 2026, requires AI-generated content to be machine-readably marked and disclosed. Penalties: up to 15 million euros or 3% of global turnover. Google’s SynthID has already watermarked over 20 billion pieces of AI content. The C2PA standard embeds provenance metadata directly into files.

But watermarking solves attribution, not compliance. The full audit trail needs to connect generation to evaluation to deployment, with every decision logged. If you’ve built production AI systems, this should sound familiar — it’s the same observability infrastructure any LLM pipeline needs.

The hard question

Can LLM judges reliably catch the brand violations that actually matter?

Wrong logo, off-palette colours — easy. Any prompted evaluator catches those. The hard stuff is a headline that’s technically on-brand but tonally wrong for the British market. An image that follows every guideline but feels cheap. Copy that sounds robotic despite using approved terminology.

These are judgment calls from years of accumulated context — exactly the kind of tacit knowledge that’s hardest to encode, and exactly the kind of failure that does the most damage.

The teams doing this well use judges as a filter, not a replacement. Catch the mechanical violations automatically. Route flagged assets to humans with specific notes. The judge says: “This scored 6/10 on brand voice. Here’s why.” Automated tests catch regressions; code review catches design issues. You don’t skip either one, and you don’t run the automated tests by hand.

This is what we build at Jetty. You define your brand rules as judge criteria, point it at your content pipeline, and the LLM judge scores every asset — then hill-climbs to improve the ones that fall short. Minutes to set up, not months.

The content industry built incredible generation capacity and forgot to build the CI pipeline. The teams that close that gap first won’t just generate better content — they’ll be able to prove it’s on-brand at a volume where proof actually matters.

The rest will keep spot-checking 2% and waiting for the next PR crisis.

How LLM Judges Make AI Stop Looking Generic

Jonathan Lebensold — Mon, 02 Mar 2026 12:41:30 GMT

I needed brand-consistent illustrations. Not stock photos. Not whatever Midjourney feels like on a given day. Images that match a specific style guide: navy-and-cream linocut prints, or flat vector pelicans on dark blue. Visual consistency that makes a brand feel intentional.

The problem is obvious to anyone who’s tried. Without guardrails, every generation is a roll of the dice. One image comes back photorealistic. The next is cartoonish. A third nails the palette but the composition is wrong. You’re eyeballing each one, and eyeballing doesn’t scale. Worse, it doesn’t transfer. The person who finally cracked the right prompt for last week’s image isn’t around when someone else needs one this week.

This is the “prompt wizard” problem that every organization using generative AI runs into eventually. Someone develops an intuition for how to coax the right output from a model. That knowledge lives in their head. They become the spell-caster, and the rest of the team waits in line. It’s artisanal in the worst sense: unrepeatable, unverifiable, and fragile.

So I built a two-step workflow: generate an image, then judge it against my brand style guide. The loop takes about 30 seconds. Three iterations took me from 2/10 to 9/10. More importantly, the rubric means anyone can run it.

The setup

Step one: generate an image with Gemini.

Step two: send it to GPT-4o with a scoring rubric that describes my brand.

The rubric:

Rate how well this image matches the Pelican Brand style: minimalist flat vector illustration OR vintage hand-drawn printmaking sardine tin art. Key criteria: (1) limited color palette — navy, white, gold for flat vector OR 1-2 muted colors on cream for linocut, (2) correct style, (3) simple composition with the subject filling the frame, (4) NO photorealism, NO cartoon style, NO busy backgrounds. Score 1-10.

One generation model, one judge model, one rubric. The judge returns a score and an explanation of what’s wrong.

Run 1: the raw prompt (2/10)

I started with the kind of prompt you’d write for any image generator:

“A close-up of a magnifying glass held over a printed illustration. Through the lens, the image appears sharper and more defined. Warm studio lighting, shallow depth of field.”

The result looked like this:

Technically impressive. Completely off-brand. The judge scored it 2/10: “Photorealistic with complex shading and depth of field. Does not match either the flat vector or printmaking style.”

I knew this would fail. The point is that the judge articulates why in terms I can act on: too realistic, wrong style, too much detail.

Run 2: adapted to brand style (8.5/10)

I rewrote the prompt using a template I’d developed for the linocut style:

“Bold linocut block print illustration on cream paper. A circular feedback loop with four stations: a pencil, a picture frame, a magnifying glass, and a gauge. Thick arrows connect them in a cycle. Hand-carved woodcut style, imperfect ink edges, heavy line weight. Single color dark navy blue ink on cream paper. No text. Vintage printmaking aesthetic.”

Same concept. Completely different constraints:

The judge scored it 8.5/10: “Closely matches the vintage hand-drawn printmaking style. Limited color palette with navy on cream. Simple composition with the subject filling the frame.”

From 2 to 8.5 in one rewrite.

Run 3: refined from feedback (9/10)

The judge’s only complaint: composition could be tighter. I added “subject enclosed in a single bold circle, tightly cropped” and strengthened the ink texture language:

Score: 9/10. Three runs. Five minutes of prompt editing. The judge did the hard work of evaluating consistency.

Without the judge vs. with it

To make the difference concrete, I ran five short nautical prompts through both pipelines. Same concepts, same image model. The only difference: one has the brand judge, one doesn’t.

Without the judge — five 3-word prompts, no guardrails:

Anchor in rough seas:

Lighthouse at night:

Ship in a bottle:

Sailor tying a knot:

Five prompts, five completely different visual styles. Photorealism, nature photography, portraiture, still life.

The prompt templates do the heavy lifting. Once I dialed them in with the judge’s feedback, producing a new on-brand image is a one-liner: swap in the concept, run the pipeline. No wizardry required.

The pattern

This isn’t really about images. It’s the same loop I keep running into with AI systems: close the gap between generation and evaluation.

Generate with whatever model you want
Judge against a written rubric using a different model
Read the feedback and adjust
Repeat until the score converges

The architecture is two API calls and a scoring prompt. You could wire this up with a script, but the reason I built it as a Jetty pipeline is that the same workflow runs for every image, every time, without someone babysitting it. Define the rubric once, and anyone on the team can generate on-brand images without becoming a prompt expert. The insight is that LLM-as-judge works for images, not just text. A written rubric can encode brand guidelines well enough to automate taste.

This is what kills the spell-caster problem. The wizard’s intuition about “what works” gets externalized into a rubric that anyone can run. No one needs to know which magic words make Gemini produce a good linocut. The judge tells you what’s wrong, and the fix is usually obvious from the feedback. The taste lives in the rubric, not in someone’s head.

Every image I generate without this loop is a coin flip that either reinforces my brand or dilutes it. With the loop, the images converge. After a few rounds, you develop reusable prompt templates that score 8+ on the first try.

Try it

The whole pipeline is one workflow definition and a rubric. I built it by asking Jetty’s CLI to create a two-step pipeline: generate an image, then judge it. That’s it. The rubric is a text file. The prompt templates are text files. Once you have a look and feel that works, every new image is a one-line prompt with the concept swapped in.

If you want to adjust the judge, you don’t need to touch code. Update the rubric text and run it again. I’ve tweaked the scoring criteria three times since I started, each time just by editing what “on brand” means in plain English.

Fork the repo, swap in your own style guide, and run it. The loop works for any brand style, not just mine. Write a rubric that describes what “on brand” means for you. Be specific: name the colors, the style, what to avoid. The judge will tell you what’s wrong faster and more consistently than you can eyeball it.

If your brand images look different every time, the problem isn’t the image model. It’s the absence of a feedback loop.

Meter Before You Manage

Jonathan Lebensold — Sun, 01 Mar 2026 16:20:16 GMT

Your CFO pings you about a $47,000 “OpenAI API” line item. You know it’s high. You’ve known for months. But when they ask which features drive the spend, you can’t answer. Not because you’re hiding something. Because you don’t know.

Which endpoint? Which model? Which user flow? The monthly invoice is a single opaque line, and every conversation about it ends the same way: “We’ll optimize later.”

Later never comes. Optimization without instrumentation is guessing, and guessing doesn’t get sprint tickets.

The restaurant with no menu prices

Imagine running a restaurant where you can see total food cost each month but not which dishes drive it. You spent $40,000 on ingredients. Is the wagyu steak killing you, or is it the house salad that secretly costs twice what you charge? Cut portion sizes? Switch suppliers? Drop a menu item? Without per-dish cost data, every move is a guess.

The generic LLM cost advice floating around is the same kind of guessing. “Use smaller models.” For which calls? “Cache repeated queries.” Which queries repeat? “Shorten your prompts.” Which ones, and by how much?

Good strategies, all of them. None are actionable without metering.

What metering actually means

Metering isn’t observability. You might already have traces flowing into Langfuse or Arize. Good. But traces tell you what happened on individual requests. Metering tells you what your system costs in aggregate, broken down by dimensions you can act on.

Three layers, each building on the last.

Layer 1: The toll booth. Every LLM call goes through a proxy that records which model it hit, how many tokens it consumed, what it cost, and who requested it. LiteLLM is the most common choice here. One proxy, every provider, every call metered. No more reconciling your OpenAI bill against your Anthropic bill against your Azure bill in a spreadsheet. One ledger.

If your monthly cost conversation starts with someone logging into three different provider dashboards and adding numbers up manually, you’re at layer zero. Setting up a proxy is a day of work, maybe two. Per-team API keys give you basic allocation immediately.

Layer 2: Attribution. Raw metering gives you totals by model and endpoint. Attribution gives you totals by business dimension. “The FAQ chatbot costs $18,000 a month.” “The document extraction pipeline is 60% of our spend.” “Team X’s experimental feature costs more than the core product.”

This is where Langfuse traces become powerful. Not as individual request debuggers, but as the data source for cost attribution. Tag traces by feature, team, and customer tier. Aggregate. Now you have a menu with prices.

Layer 3: Optimization with evidence. Once you can see that the FAQ chatbot costs $18,000 a month, you can ask the right questions. Why so much? It handles 500,000 requests a month at 1,500 tokens average, mostly system prompt, on GPT-4o. Someone picked that model eight months ago and nobody revisited.

Now “use a smaller model” is actionable. Route FAQ classification to GPT-4o-mini and that $18,000 drops to under $3,000. Run the experiment. Measure quality. Show your CFO the before-and-after.

Without layers 1 and 2, layer 3 is a fantasy. Every optimization tip on the internet assumes you’ve done the metering work. Almost nobody has.

The scale of what’s hiding

Once you have metering, the waste is hard to miss.

Roughly a third of LLM queries are semantically similar to previous ones. A third of your spend, going to questions you’ve already answered. Semantic caching cuts 60-70% of those costs and drops latency from hundreds of milliseconds to double digits. But you can’t build a cache policy without knowing which queries repeat.

The price gap between model tiers is 20-30x, not 2x. Flagship models like GPT-4o run $2.50-10 per million tokens. Budget-tier alternatives like GPT-4o-mini or DeepSeek run a fraction of that. For classification, routing, and formatting tasks, the cheaper model often matches the expensive one. But “often” isn’t “always,” and you need per-task quality metrics to know which calls can safely move down.

These aren’t exotic findings. They’re the baseline state of most production AI systems. The waste is there whether you can see it or not. Metering just makes it visible.

Why “we’ll optimize later” never happens

I’ve talked to dozens of engineering leads who have “LLM cost optimization” on their roadmap. Not one has a sprint ticket for it.

The problem is activation energy. Without metering, optimization has no clear starting point. You’d need to audit every LLM call, figure out which models each endpoint uses, estimate traffic per route, benchmark alternatives, and build evaluation infrastructure to measure quality impact. That’s weeks of work with uncertain payoff spread across dozens of endpoints.

The mental math writes itself: “Two weeks analyzing costs for maybe 30% savings, or two weeks shipping a feature customers are asking for.” The feature wins every time.

Metering collapses this. When you can see that one chatbot feature costs $18,000/month on an overpowered model, the optimization becomes a two-hour task, not a two-week audit. The gap between “massive research project” and “swap a model string and validate” is just visibility.

The teams that optimize aren’t more disciplined. They instrumented earlier.

The uncomfortable parallel

Every company that’s been through cloud cost optimization knows this pattern. The early days of AWS were identical: teams spun up instances, ran workloads, and got a single bill at the end of the month. “Cloud spending” was one line item. It took years of tooling before teams could manage what they were spending. Cost allocation tags. Reserved instance planning. Right-sizing recommendations. Each layer made the next optimization possible.

LLM costs are at that same inflection point. The difference is speed. Cloud cost maturity took a decade. Enterprise LLM spending more than doubled in six months, from $3.5 billion to $8.4 billion. The scrutiny is coming faster than the tooling.

The teams that instrument now will be ready. The teams that don’t will keep promising finance they’ll optimize “next sprint,” knowing they can’t start what they can’t see.

AI Optimization Is a Game of Whack-a-Mole

Jonathan Lebensold — Thu, 26 Feb 2026 03:59:42 GMT

I watched a team spend three months optimizing their AI pipeline. Good engineers, plenty of budget. They’d fix one thing and something else would break. Latency improved but quality dropped. Quality recovered but costs crept back up. They’d tune the retrieval step and the summarization step would start hallucinating.

After three months, I asked them to show me the net improvement. They couldn’t. Not because it was negative, but because it was unmeasurable. They had no baseline from before the work started, no way to tell whether the system was better than it was ninety days ago.

They’d been playing whack-a-mole.

The pattern

Here’s how it usually goes.

A team ships an AI feature. It works. Users like it. Leadership wants it faster, cheaper, more reliable. So an engineer starts optimizing.

They look at costs. For example, GPT-5 might be seen as expensive, so they swap the classification step to GPT-4o-mini. It’s cheaper, it’s fast, and for classification the quality should be comparable. They test against a handful of examples. Everything looks good.

A week later, a support ticket comes in. The classifier is miscategorizing a specific type of input that the old model handled correctly. Nobody connects the ticket to the model swap because it doesn’t look like a regression. It looks like a new bug.

Meanwhile, another engineer shortens the generation prompt to reduce token costs. Average outputs improve. They’re more concise, less repetitive. But a long tail of edge cases that the verbose prompt used to handle now produce garbled responses. The team doesn’t notice because they’re testing against examples they assembled two months ago, not against the inputs the system actually sees.

A month later, someone upgrades the embedding model to improve retrieval quality. Retrieval gets better. But the downstream summarizer was tuned for the old embedding distribution, and now it sometimes misinterprets what the retrieval step returns. The summarizer didn’t change. The retrieval step didn’t break. The interface between them shifted, and nobody was watching.

Each individual change was defensible. Each one made some metric better. And the system, in aggregate, isn’t meaningfully different from where it started.

Why this happens

The whack-a-mole pattern has a few root causes, and they compound.

No baselines. Most teams don’t snapshot their system’s performance before starting optimization work. They have a vague sense that it costs too much or quality isn’t where it should be, but no structured measurement to compare against. Without a baseline, improvement is an impression, not a fact.

Stale evaluation sets. The examples teams test against are usually assembled once, early in development, and rarely updated. They represent what the system used to see, not what it sees now. Real user inputs drift. New edge cases emerge. The evaluation set becomes a shared fiction that everyone treats as ground truth.

Invisible dependencies. AI pipelines aren’t linear. A change to retrieval affects generation. A change to the prompt affects how the model uses retrieved context. A change to the model affects how it interprets the prompt. These coupling effects are hard to predict and easy to miss when you’re only measuring the step you changed.

And then there’s the one that deserves its own section.

The model swap trap

Many teams I’ve worked with has done some version of this: a new model comes out, it’s cheaper or faster or scores higher on a benchmark, so they swap it in. The public leaderboard says it’s better. The provider’s blog post says it’s better. The handful of test cases they run say it’s better.

It isn’t. Not universally.

Models don’t improve uniformly across all capabilities. A model that scores higher on MMLU might handle tool calls differently. A model that’s faster might be more terse, great for chat but terrible for document generation. A model trained on newer data might have different failure modes than the one you spent weeks tuning prompts for.

The problem isn’t that the new model is worse. “Better” is a distribution, not a scalar. The new model is better on average and worse on a long tail of edge cases your evaluation set doesn’t cover, because it was built before those edge cases existed.

This is why teams end up in whack-a-mole. They make a change that’s net positive on the metrics they track and net negative on metrics they don’t. Then they fix the newly visible problem, which introduces a new invisible one.

We’ve been here before

If you were writing software in the early 2000s, this pattern has a familiar shape.

Before continuous integration, teams developed in isolation for weeks, then merged everything together and prayed. “Works on my machine” was the official status report. Integration day was a reckoning. Bugs appeared at the boundaries between components, and nobody could tell whose change caused which failure.

CI solved this by making integration continuous. Every change was tested against the whole system immediately. You didn’t discover boundary failures at the end. You discovered them when they were introduced, when the context was fresh and the blast radius was small.

AI teams are in the pre-CI era right now. They develop changes in isolation, test against static evaluation sets, and deploy into a system where everything else has also changed. The boundary failures show up days or weeks later, disconnected from the change that caused them.

The fix is the same one. A continuous loop where every change is evaluated against the current state of the whole system, using data that reflects what the system actually encounters in production.

What the exit looks like

The exit from whack-a-mole isn’t working harder or being smarter about which changes to make. It’s building the measurement infrastructure that tells you whether your changes are helping.

Start with a real baseline. Before you optimize anything, measure your system end-to-end against production-representative data. Cost per trace. Quality per step. Error rates by input type. Latency distributions. This is your starting line.

Evaluate the system, not the step. When you swap a model or change a prompt, don’t just test that step in isolation. Run the whole pipeline. The dependencies between steps are where whack-a-mole lives.

Refresh your evaluation data. Your test set should be a living sample of what your system actually sees, not a museum of what it used to see. Pull from production traces. Include the weird inputs, the edge cases, the failures. If your evaluation set hasn’t changed in a month, it’s already lying to you.

Measure before and after every change. Almost nobody does this. Run your evaluation suite, make the change, run it again. If you can’t show a net improvement across the metrics that matter, the change isn’t an improvement. It’s a lateral move.

Track the aggregate. Individual step metrics are useful but insufficient. You need a system-level view: cost per successful outcome, end-to-end quality, cumulative error rates. If step metrics go up but system metrics stay flat, you’re playing whack-a-mole with extra steps.

The uncomfortable truth

The teams I’ve seen break out of the whack-a-mole cycle aren’t the ones with the best engineers or the most sophisticated models. They’re the ones that invested in measurement before they invested in optimization.

That’s a hard sell. When leadership wants costs down by next quarter, the natural response is to start cutting. Swap a model, shorten a prompt, add a cache. Each change feels productive. The dashboard numbers move. But “productive” and “effective” are different things when you have no infrastructure to tell them apart.

The question worth asking isn’t “what should we optimize next?” It’s “do we have any way to know if our last optimization actually worked?”

If the answer is no, that’s the first thing to fix.

Foundation Models Ship Like Windows 98

Jonathan Lebensold — Tue, 17 Feb 2026 03:30:32 GMT

Early in my career, I watched coworkers dread Oracle upgrades the way you’d dread a root canal. Weeks before a release, the mood would shift. Development froze. Test plans ran to a hundred pages. Someone would pick a Friday night, and the whole team would brace for impact. If the upgrade failed, Saturday was a rollback. If it succeeded, Monday was triage.

We called these Big Bang releases. Months of change compressed into a single explosive moment, with everyone hoping the blast didn’t take production down with it.

The pressure was the part that stayed with me. As release dates approached, you could feel it building across the team. Not excitement. Dread. The kind of tension that comes from knowing you’ve accumulated so much change that nobody can predict what will happen when you flip the switch.

The software industry spent the next fifteen years dismantling this approach. Continuous integration. Automated testing. Feature flags. Deploy-on-merge.

The early adopters made the rest of us look like we were standing still. In 2009, Flickr’s John Allspaw and Paul Hammond got on stage at Velocity and announced they were deploying to production ten or more times a day. That talk kicked off the DevOps movement. Etsy took it further — new engineers shipped code to production on their first day, and by 2014 they were pushing 80+ deploys daily with 150 engineers. Amazon hit a deploy every 11.6 seconds by 2011. Netflix open-sourced Spinnaker and built a deployment culture so confident they’d randomly kill production services just to prove they could recover.

By 2015, a team of two thousand engineers could push hundreds of commits a day, each one tested, deployed, and monitored independently. The Big Bang was dead.

Or so I thought.

Shrink-wrap AI

Watch how foundation model providers ship. OpenAI releases GPT-4, then GPT-4 Turbo, then GPT-4o. Anthropic releases Claude 3, then 3.5, then 4. Each version is a major event with a blog post, a benchmark table, and a wave of developers scrambling to figure out what changed.

This is shrink-wrap software. It’s the same mentality that gave us Oracle 8i and Windows ME: build it behind closed doors, stamp a version number on it, and push it out the door. The release itself is the artifact. Everything that happened between versions is invisible to the people who depend on the product.

For the model providers, this might be unavoidable. Training runs are expensive, and you can’t exactly deploy a half-trained model to production. But for every team building on top of these models, inheriting the shrink-wrap mentality is a choice. And it’s the wrong one.

The second Big Bang

Here’s what I see in practice. A team builds an AI feature against GPT-4. They test it, tune the prompts, get the outputs to an acceptable quality. Ship it. Months pass. OpenAI deprecates the model version. Or a new model comes out that’s cheaper and supposedly better.

Now the team faces a familiar decision: migrate everything at once or fall behind. They pick a weekend. They swap the model. They rerun their evaluation suite, which is a gold dataset they assembled six months ago, and the numbers look fine.

Monday morning, support tickets start rolling in. The new model handles tool calls differently. It’s more verbose in some cases, terser in others. Edge cases that the old model handled gracefully now produce hallucinations. The evaluation suite didn’t catch any of this because it was testing against a frozen snapshot of reality, not reality itself.

This is the Big Bang release, reborn. Different technology, same failure mode.

What CI actually solved

The insight behind continuous integration wasn’t just “deploy more often.” It was that small, incremental changes are fundamentally easier to reason about than large ones.

When you deploy a single commit, you know exactly what changed. If something breaks, you know where to look. When you deploy six months of accumulated changes in one shot, you’re debugging a combinatorial explosion. Every change interacts with every other change, and the failure could be anywhere in the stack.

CI worked because it made each change small enough to understand, test, and roll back independently. It didn’t eliminate risk. It made risk manageable.

AI needs the same loop

The parallel to AI systems is direct, but the loop looks different.

In traditional CI, the cycle is: write code, run tests, deploy, monitor, repeat. The code changes. The tests and infrastructure stay relatively stable.

In AI systems, everything moves. The model changes when providers ship updates. The data changes as users interact with your system. The prompts change as your team iterates. The retrieval corpus changes as new documents get indexed. You can’t hold any of it still long enough to test it the way you’d test a REST endpoint.

So what does CI for AI actually look like?

Grab production data. Not a curated test set from six months ago. Actual inputs and outputs from your running system. This is the raw signal that tells you how your system behaves in the real world.

Sanitize and transform. Strip PII, anonymize where needed, handle compliance requirements. This is non-trivial but it’s infrastructure, not a blocker. Teams that treat privacy as a reason to avoid production data entirely end up flying blind.

Label and categorize. Not everything needs human labeling. Cluster similar inputs. Flag anomalies. Use LLM-as-judge for initial quality assessments. Build a living picture of what your system handles well and where it struggles.

Benchmark against the next iteration. Before you swap a model, change a prompt, or update your retrieval pipeline, run the proposed change against your current production-derived evaluation set. Not a public leaderboard. Not an academic benchmark. Your data, your users, your edge cases.

Deploy incrementally. Don’t swap everything at once. Canary the change. Route 5% of traffic to the new configuration. Compare quality metrics side by side. Expand or roll back based on evidence, not hope.

Feed the results back. The outputs from this cycle become the inputs to the next one. New failure modes get added to the evaluation set. Successful optimizations become the new baseline. The loop tightens with every iteration.

This is the CI/CD loop applied to AI. It’s not a metaphor. It’s the same engineering discipline, adapted for a system where both the code and the data are moving targets.

Why teams resist this

The most common objection is that it’s too much infrastructure for the payoff. “We’re a small team. We just need to ship features.”

I heard the exact same argument against automated testing in 2008. Teams that skipped CI shipped faster for a quarter, then spent the next year debugging integration failures and botched releases. The teams that invested early moved slower at first but compounded their advantage over time.

The second objection is that AI is different. Models are black boxes. You can’t unit test a neural network the way you test a function.

This is true and also beside the point. You don’t need to unit test the model. You need to continuously evaluate the system: the model plus the prompts plus the retrieval plus the post-processing plus the guardrails. The system is what your users interact with. The system is what you can measure and improve.

The compounding advantage

Teams that close this loop gain something that’s hard to replicate: a continuously improving evaluation set derived from their actual production traffic. Every week, their benchmarks get more representative. Every iteration catches more edge cases. Every deployment carries less risk because the safety net is woven from real-world data, not synthetic test cases.

Teams that don’t close the loop are stuck in the 90s. They test against stale datasets. They deploy model swaps as Big Bang releases. They discover problems from support tickets instead of automated analysis. Each deployment is a gamble, and the odds don’t improve with time.

The gap is closing

The tooling for AI CI/CD is maturing fast. Platforms like Langfuse give you the trace data. LLM-as-judge frameworks handle automated quality assessment. Prompt management tools support versioning and A/B testing. The pieces exist.

What’s missing for most teams isn’t tooling. It’s the mindset shift. The recognition that AI systems aren’t something you build, test, and ship. They’re something you continuously operate, measure, and improve. That’s what CI meant for software. It’s what CI needs to mean for AI.

The software industry took a decade to move from Big Bang releases to continuous delivery. AI systems are being deployed to millions of people right now. We don’t have a decade. But we do have the playbook. We’ve done this before.

The question is whether your team is running the 2025 version of the loop, or the 1998 one.

Stop Building Against Gold Datasets

Jonathan Lebensold — Sun, 15 Feb 2026 15:55:13 GMT

In grad school, the first thing you learn in any machine learning course is to find a good dataset. MNIST for vision. SQuAD for reading comprehension. GLUE for language understanding. These are “gold datasets”: carefully curated, cleanly labeled, academically blessed. They exist so you can focus on the algorithm and avoid getting tangled in messy data problems.

This makes sense in a classroom. It makes no sense in production. Gold datasets teach you to think about data as something you obtain once, hold still, and measure against. That assumption breaks the moment your system touches real users.

The dream of separating code from data

Software engineering spent decades pulling code and data apart. FORTRAN had DATA statements that hardcoded values directly into the program. COBOL formalized the separation with its DATA DIVISION. From there, the trajectory was consistent: relational databases, config files, environment variables. The Twelve-Factor App made it doctrine: “strict separation of config from code.”

Then foundation models arrived and broke the assumption entirely.

An LLM’s behavior is inseparable from the data it was trained on, the examples in its prompt, the documents retrieved by its RAG pipeline. Change any of these and you change what the system does. The data is the code. We’ve come full circle, back to FORTRAN’s DATA block, except now the data block is a few billion parameters and a prompt template that gets rewritten every sprint.

What gold datasets actually measure

Here’s the uncomfortable truth about the benchmarks the ML community treats as ground truth: they’re often wrong.

A 2024 analysis of MMLU, the benchmark most commonly cited when comparing LLM capabilities, found that 6.5% of questions contain errors. In the virology subset, 57% of questions had problems: wrong answers, ambiguous phrasing, missing context. The maximum achievable score isn’t 100%, and nobody agrees on what it actually is.

When researchers added adversarial distractor sentences to SQuAD reading comprehension passages, model accuracy dropped from 75% to 36%. With ungrammatical adversarial sequences, it fell to 7%. The models weren’t reading. They were pattern-matching against a dataset they’d learned to game.

The deeper problem is Goodhart’s Law: when a measure becomes a target, it ceases to be a good measure. Models that score 87% on HumanEval for code generation drop to around 30% accuracy on real-world codebases with cross-file dependencies, internal frameworks, and legacy patterns. The benchmark says the model is excellent. Production says otherwise.

The team tension nobody talks about

I watched this dynamic play out repeatedly before and while building Jetty.

The data science team builds a pipeline against a clean CSV from six months ago. The data is tidy. The labels are consistent. The model performs beautifully. Meanwhile, the developer integrating it into production knows that real inputs look nothing like that CSV. Fields are missing. Formats are inconsistent. Users submit things nobody anticipated.

Both sides are acting rationally. The data scientist needs controlled conditions to iterate on the model. The developer needs to ship something that works for actual users. The gold dataset becomes a shared fiction: everyone references it, nobody fully trusts it, and the gap between lab performance and production reality widens with every sprint.

This isn’t a communication problem. It’s a structural one. Gold datasets encode the assumption that you can fix your data, fix your model, and measure once. AI systems break that assumption. The data distributions shift, the models get updated, the retrieval systems change, the code evolves. A frozen dataset tells you how the system performed against a historical snapshot, not how it performs right now.

Data collection is not a phase

The alternative isn’t “better gold datasets.” It’s abandoning the concept entirely and treating data collection as a continuous, live process wired into production.

Shreya Shankar’s research put it bluntly: “We have no idea how models will behave in production until production.” Her interviews with ML engineers across chatbots, autonomous vehicles, and finance found a consistent pattern: the teams that succeed close the loop fastest, continually cycling between data collection, experimentation, staged evaluation, and monitoring.

This is the data flywheel. Your product generates data. You analyze that data to find where the system fails. You fold those failures back into your evaluation sets and training data. Each cycle makes the next one more valuable, because you’re measuring against reality, not a proxy for it.

Hamel Husain makes a practical version of this argument: start with error analysis. Look at your production data. Categorize failures. Write evals that catch the failures you’ve actually seen. He warns against chasing high pass rates. If you’re at 100%, you’re not stress-testing your system. A 70% on meaningful, production-derived evals tells you more than 95% on a benchmark that doesn’t reflect your users.

But what about PII?

The main objection I hear is privacy. “We can’t use production data — it contains PII. Regulatory compliance makes this impossible.”

This is real. But it’s not a reason to default to stale snapshots. It’s a reason to invest in the infrastructure that makes live data collection safe. Anonymization pipelines. Differential privacy. Synthetic data generation from production distributions. None of this is trivial, but the alternative is worse. You’re not protecting users by ignoring how your system behaves in the wild. You’re deferring the risk.

The privacy experts I’ve worked with are far more worried about uninstrumented systems that can’t detect when they fail on a vulnerable population than about well-designed systems that safely collect production signals. Avoiding production data entirely isn’t caution. It’s the opposite of it.

The first step

If your team is still building against a gold dataset, here’s where to start: pick one pipeline that matters, instrument it to capture real inputs and outputs, and run your existing evals against production data instead of your test set. You will be surprised by the gap. Production is weirder, messier, and more varied than any curated dataset can capture. That gap is the information you need. Gold datasets hide it. Live data reveals it.

The question isn’t whether your gold dataset is good enough. It’s whether you can afford to keep pretending that a frozen snapshot tells you anything useful about a system that never stops changing.

Observability Won’t Save Your AI System

Jonathan Lebensold — Fri, 13 Feb 2026 15:16:22 GMT

72% of API calls were redundant. The data was sitting right there in Langfuse — every trace, every duplicate input, every wasted dollar — for months. Nobody noticed.

This wasn’t a startup running on duct tape. It was a production extraction service with proper instrumentation, dashboards, alerts. They had observability. What they didn’t have was anyone systematically analyzing what the observability was showing them.

I see teams invest real effort getting traces flowing into Langfuse or Arize or whatever platform they’ve chosen. They build dashboards. They set up alerts on latency and error rates. Then they treat the problem as solved.

It isn’t.

The dashboard paradox

Dashboards are great at confirming what you already suspect:

Latency spiking? Check the dashboard.
Heard about an outage? Check the dashboard.

But dashboards are terrible at surfacing what you don’t know to look for.

The duplicate calls weren’t hiding. They were in plain sight — scattered across thousands of individual traces. Any engineer could have pulled the data, grouped by input hash, and seen the duplication. But why would they? The system was working. Requests went in, responses came out. The dashboard showed green.

When was the last time you looked at your traces and found something you didn’t already know?

That’s the gap. Not data collection — the tools handle that well. The gap is between having data and acting on data. Between seeing traces individually and understanding what they mean in aggregate.

We’ve been here before

The software industry went through this exact evolution over the past two decades.

First came logs. Teams shipped code and hoped for the best. When something broke, you SSH’d into a box and tailed a log file. I spent years convincing engineering leaders that widespread monitoring was worth the investment. Eventually it became table stakes.

Then came the dashboards: Nagios, Munin, then Datadog and New Relic. CPU, memory, request rates, error counts. You could see problems faster. But you still had to know what to look for.

More recently, we have open telemetry and APM (application performance monitoring). Tools that don’t just collect metrics but trace requests end-to-end, correlated events across services, and surface anomalies automatically. The shift wasn’t more data: it was smarter analysis of the data you already had.

And finally, automated remediation. Auto-scaling, self-healing infrastructure, chaos engineering. Systems that didn’t just detect problems but responded to them. Each layer built on the one before it. Nobody skipped from logs straight to auto-remediation. But nobody stopped at logs either.

AI observability is stuck at layer one

Most AI teams today are somewhere between logs and monitoring. They’ve got traces flowing. They’ve got dashboards. Some have alerts. This is genuinely important work — platforms like Langfuse have made it dramatically easier to see what’s happening inside LLM-powered systems.

But it’s layer one.

A single trace looks fine. Ten thousand traces reveal that 40% of your spend goes to resending the same system prompt on every conversational turn. Your error rate is 5% overall — sounds acceptable — until you break it apart and find one pipeline step fails 40% of the time, masked by the steps that never fail. You know your monthly LLM bill, but not which workflow drives it, which model version inflates it, or which operations could drop to a cheaper model without quality loss.

Your system worked great three months ago. Something changed — a model version, a data distribution, a prompt template — and quality degraded slowly enough that no alert fired. Layer one doesn’t watch for gradual shifts. It watches for threshold breaches.

The action gap

I’ve analyzed traces from lots of production AI systems and the pattern is remarkably consistent: every project has observability in place, and every project has obvious optimization opportunities sitting unnoticed in the data.

The median finding is 30–60% cost savings from fixes that take less than an hour. A response cache here. A model downgrade there. A prompt caching config change. Pin a model version instead of running five old ones simultaneously.

These aren’t exotic fixes. They’re the kind of thing any senior engineer would implement immediately — if someone pointed them out. The problem isn’t capability. It’s attention. Nobody’s job is to stare at every trace and notice that the same input keeps showing up.

Observability tells you what happened. Analysis tells you what it means. The layer most teams are missing is the one that turns analysis into specific, prioritized actions.

What the next layer looks like

The APM analogy isn’t just historical color. It’s predictive. The same evolutionary pressure that pushed software monitoring toward automated analysis is now pushing AI observability in the same direction.

Layer two is automated analysis. Not more dashboards — systematic examination of traces that surfaces patterns humans miss. Redundancy detection. Cost decomposition. Error clustering. Quality drift measurement. The kind of analysis you’d do if you had infinite time and perfect attention, run continuously against your production data.

Layer three is automated action. Analysis produces recommendations. Recommendations become pull requests. A system that doesn’t just tell you “you’re spending too much on model X” but opens a PR that swaps it for a cheaper alternative and shows you the quality comparison.

We’re not at layer three yet. But layer two is here, and most teams haven’t adopted it.

The uncomfortable question

When was the last time you did a systematic analysis of your traces? Not a dashboard check. Not an incident investigation. A deliberate, comprehensive review of what your system is actually doing.

If the answer is “never” or “I’m not sure,” you’re not alone. That’s almost everyone. And it means there are patterns in your data — waste, errors, drift — that you haven’t found yet.

Observability was the right first step. But the teams that pull ahead won’t be the ones with the best dashboards. They’ll be the ones that close the gap between seeing and acting.