AI Optimization Is a Game of Whack-a-Mole

The measurement problem hiding inside your optimization work

Feb 26, 2026

I watched a team spend three months optimizing their AI pipeline. Good engineers, plenty of budget. They’d fix one thing and something else would break. Latency improved but quality dropped. Quality recovered but costs crept back up. They’d tune the retrieval step and the summarization step would start hallucinating.

After three months, I asked them to show me the net improvement. They couldn’t. Not because it was negative, but because it was unmeasurable. They had no baseline from before the work started, no way to tell whether the system was better than it was ninety days ago.

They’d been playing whack-a-mole.

The pattern

Here’s how it usually goes.

A team ships an AI feature. It works. Users like it. Leadership wants it faster, cheaper, more reliable. So an engineer starts optimizing.

They look at costs. For example, GPT-5 might be seen as expensive, so they swap the classification step to GPT-4o-mini. It’s cheaper, it’s fast, and for classification the quality should be comparable. They test against a handful of examples. Everything looks good.

A week later, a support ticket comes in. The classifier is miscategorizing a specific type of input that the old model handled correctly. Nobody connects the ticket to the model swap because it doesn’t look like a regression. It looks like a new bug.

Meanwhile, another engineer shortens the generation prompt to reduce token costs. Average outputs improve. They’re more concise, less repetitive. But a long tail of edge cases that the verbose prompt used to handle now produce garbled responses. The team doesn’t notice because they’re testing against examples they assembled two months ago, not against the inputs the system actually sees.

A month later, someone upgrades the embedding model to improve retrieval quality. Retrieval gets better. But the downstream summarizer was tuned for the old embedding distribution, and now it sometimes misinterprets what the retrieval step returns. The summarizer didn’t change. The retrieval step didn’t break. The interface between them shifted, and nobody was watching.

Each individual change was defensible. Each one made some metric better. And the system, in aggregate, isn’t meaningfully different from where it started.

Why this happens

The whack-a-mole pattern has a few root causes, and they compound.

No baselines. Most teams don’t snapshot their system’s performance before starting optimization work. They have a vague sense that it costs too much or quality isn’t where it should be, but no structured measurement to compare against. Without a baseline, improvement is an impression, not a fact.

Stale evaluation sets. The examples teams test against are usually assembled once, early in development, and rarely updated. They represent what the system used to see, not what it sees now. Real user inputs drift. New edge cases emerge. The evaluation set becomes a shared fiction that everyone treats as ground truth.

Invisible dependencies. AI pipelines aren’t linear. A change to retrieval affects generation. A change to the prompt affects how the model uses retrieved context. A change to the model affects how it interprets the prompt. These coupling effects are hard to predict and easy to miss when you’re only measuring the step you changed.

And then there’s the one that deserves its own section.

The model swap trap

Many teams I’ve worked with has done some version of this: a new model comes out, it’s cheaper or faster or scores higher on a benchmark, so they swap it in. The public leaderboard says it’s better. The provider’s blog post says it’s better. The handful of test cases they run say it’s better.

It isn’t. Not universally.

Models don’t improve uniformly across all capabilities. A model that scores higher on MMLU might handle tool calls differently. A model that’s faster might be more terse, great for chat but terrible for document generation. A model trained on newer data might have different failure modes than the one you spent weeks tuning prompts for.

The problem isn’t that the new model is worse. “Better” is a distribution, not a scalar. The new model is better on average and worse on a long tail of edge cases your evaluation set doesn’t cover, because it was built before those edge cases existed.

This is why teams end up in whack-a-mole. They make a change that’s net positive on the metrics they track and net negative on metrics they don’t. Then they fix the newly visible problem, which introduces a new invisible one.

We’ve been here before

If you were writing software in the early 2000s, this pattern has a familiar shape.

Before continuous integration, teams developed in isolation for weeks, then merged everything together and prayed. “Works on my machine” was the official status report. Integration day was a reckoning. Bugs appeared at the boundaries between components, and nobody could tell whose change caused which failure.

CI solved this by making integration continuous. Every change was tested against the whole system immediately. You didn’t discover boundary failures at the end. You discovered them when they were introduced, when the context was fresh and the blast radius was small.

AI teams are in the pre-CI era right now. They develop changes in isolation, test against static evaluation sets, and deploy into a system where everything else has also changed. The boundary failures show up days or weeks later, disconnected from the change that caused them.

The fix is the same one. A continuous loop where every change is evaluated against the current state of the whole system, using data that reflects what the system actually encounters in production.

What the exit looks like

The exit from whack-a-mole isn’t working harder or being smarter about which changes to make. It’s building the measurement infrastructure that tells you whether your changes are helping.

Start with a real baseline. Before you optimize anything, measure your system end-to-end against production-representative data. Cost per trace. Quality per step. Error rates by input type. Latency distributions. This is your starting line.

Evaluate the system, not the step. When you swap a model or change a prompt, don’t just test that step in isolation. Run the whole pipeline. The dependencies between steps are where whack-a-mole lives.

Refresh your evaluation data. Your test set should be a living sample of what your system actually sees, not a museum of what it used to see. Pull from production traces. Include the weird inputs, the edge cases, the failures. If your evaluation set hasn’t changed in a month, it’s already lying to you.

Measure before and after every change. Almost nobody does this. Run your evaluation suite, make the change, run it again. If you can’t show a net improvement across the metrics that matter, the change isn’t an improvement. It’s a lateral move.

Track the aggregate. Individual step metrics are useful but insufficient. You need a system-level view: cost per successful outcome, end-to-end quality, cumulative error rates. If step metrics go up but system metrics stay flat, you’re playing whack-a-mole with extra steps.

The uncomfortable truth

The teams I’ve seen break out of the whack-a-mole cycle aren’t the ones with the best engineers or the most sophisticated models. They’re the ones that invested in measurement before they invested in optimization.

That’s a hard sell. When leadership wants costs down by next quarter, the natural response is to start cutting. Swap a model, shorten a prompt, add a cache. Each change feels productive. The dashboard numbers move. But “productive” and “effective” are different things when you have no infrastructure to tell them apart.

The question worth asking isn’t “what should we optimize next?” It’s “do we have any way to know if our last optimization actually worked?”

If the answer is no, that’s the first thing to fix.

Pawel Jozefiak

Feb 27

Lived this exact loop. Switched my AI agent to Opus for everything because "it's the best model." Costs hit $200/week. Moved to Haiku for simple tasks - costs dropped 59% but quality tanked on complex reasoning. The whack-a-mole.

What finally worked: a tiered approach. Haiku for log parsing and lookups, Sonnet for most work, Opus only for synthesis. Not one model to rule them all, but the right model for each job.

Documented the math and what broke along the way: https://thoughts.jock.pl/p/claude-model-optimization-opus-haiku-ai-agent-costs-2026

Your baseline point is spot on - I wish I'd measured before I started swapping.

2 replies by Jonathan Lebensold and others

2 more comments...

The Jetty Blog: Ground Truth

Discussion about this post

Ready for more?