The Jagged Frontier Is an Evaluation Problem
Why your AI system breaks in ways your evals won't catch
I was listening to The Economist’s Boss Class podcast a few months ago when I heard Ethan Mollick describe something I’d seen a dozen times but never had a name for. He called it the jagged frontier; the idea that AI capabilities don’t advance in a smooth, predictable line. A system that can write a compelling 2,000-word analysis of a balance sheet might fail to add up a column of five numbers. Not because it’s generally capable or generally incapable. Because its competence boundary is jagged. Full of unexpected peaks and cliffs.
I immediately thought of a team I’d worked with that built a financial analysis tool. Impressive thing — it could compare revenue multiples across a peer group, flag accounting anomalies, summarize MD&A sections in plain English. Genuinely useful. Then someone asked it to add up the quarterly earnings figures it had just pulled. It got the wrong answer. Not by a little. By a lot. The model had no idea why this was embarrassing.
Smooth vs. jagged
When humans have expertise, that expertise tends to be correlated. An accountant who’s excellent at financial analysis is probably decent at arithmetic. Not because accounting requires arithmetic (it does), but because the skills develop together, from the same training, in the same career. Competence in one area is a weak signal of competence in adjacent areas.
AI systems don’t work this way. They have no such correlation. A model trained on vast amounts of financial text learns comparative analysis from millions of examples of comparative analysis. Whether it can add numbers depends entirely on whether the training process developed that particular capability — which is a separate question, running on a separate axis.
This is counterintuitive in a specific, dangerous way. When you hire a financial analyst and they produce excellent comparative analysis, you trust their arithmetic. That trust is almost always warranted. When you deploy an AI system that produces excellent comparative analysis, you might make the same inference. That inference is not warranted. And you won’t know it isn’t warranted until something goes wrong.
The BCG study
In 2023, Ethan Mollick and his colleagues at Harvard ran an experiment with 758 BCG consultants. Pre-registered, large sample — the kind of study that HN usually dismisses and couldn’t in this case. Consultants were assigned tasks either inside or outside AI’s frontier.
Inside the frontier: AI-assisted consultants produced work that was about 40% higher quality. They finished 25% faster. The productivity gains were real and substantial.
Outside the frontier — tasks that looked superficially similar but fell outside what the AI could actually do well — a different story. AI-assisted consultants made 19% more errors than their unassisted counterparts. Despite working faster. The tool was confidently wrong in ways the consultants didn’t catch, because the outputs looked like outputs that should be right.
That’s the jagged frontier in a controlled experiment. The productivity gain and the quality loss exist simultaneously, for different tasks, in the same system.
Three zones, not two
The intuitive mental model for AI deployment is binary: tasks where AI helps, tasks where it doesn’t. Deploy where it helps, don’t deploy where it doesn’t. But the BCG study points to a third zone that’s more dangerous than either. Not “AI outperforms humans” and not “humans outperform AI” — there’s a third zone of tasks nobody thought to measure, where AI fails in ways nobody anticipated. Tasks that feel similar to the ones AI is good at. Tasks the team didn’t benchmark because they assumed they were covered. Tasks that only surface as failures once the system is in production.
This is the zone that catches teams off guard. Not the obvious failures. The non-obvious ones — the tasks that sit adjacent to your evals, just outside the mapped territory, invisible until a user finds them.
I think about the car wash story that circulated a while back. Someone asked an AI assistant to navigate them to the nearest car wash, and the model gave walking directions. Fifty meters. The model couldn’t reason about the context — that someone asking about a car wash is almost certainly in a car, and nobody walks their car to a car wash. To the model, it was a navigation question. It answered the navigation question. It was completely wrong about what the user actually needed.
Funny as an anecdote but not when that’s the behavior from your customer support agent, your document processor, your financial analysis tool, doing the same thing to real users.
Why this is an evaluation problem
Here’s the structural issue. The third zone — the unmapped dangerous zone — is invisible to standard evaluation because you don’t know how to test it.
Most teams build their eval sets before deployment. You think about what the system should do, you write test cases, you run them against the system, you measure quality. What you’re measuring is performance on the tasks you thought of. The jagged frontier bites you on the tasks you didn’t think of.
This is another reason why static gold datasets fail — I’ve written about this before. Gold datasets can only cover the frontier as it existed when you built them. But the frontier shifts with every model update. A task that was inside your capability boundary with GPT-4 might be outside it with GPT-4o. Or vice versa. Your static benchmarks tell you nothing about the new cliffs.
And it gets worse with long-running workflows. Atomic tasks are easier to evaluate — you get an output, you check it. Multi-step workflows can look fine at every intermediate stage and fail at the aggregate. The jagged frontier can bite you at step three of a six-step process, and you won’t see it in your step-level metrics. You’ll see it in your support tickets.
Example: call centers
Call centers have been dealing with a version of this problem for decades. Not AI, but agents — human ones — whose competence is inconsistent and whose performance needs to be measured task by task, not just on average.
Good call center operations don’t measure “are our agents generally good.” They measure resolution rate per call type, handle time per call type, customer satisfaction per call type, escalation rate per call type. Per agent. Per shift. Per product line. They know, with specificity, which agents handle billing disputes well and which ones need support for technical escalations. The evaluation infrastructure is granular enough to find the edges.
This is what good AI evaluation looks like. Not “our system is generally performing at 87% quality” — that number hides the jagged frontier. You need quality broken down by task type, by input category, by workflow stage. You need to know where the peaks are and where the cliffs are. And you need to update that map every time the system changes, because the frontier shifts.
The challenge is that most teams don’t have this infrastructure. They have aggregate metrics. Maybe a few spot checks. An eval set that hasn’t been refreshed in three months. When a failure surfaces, they discover it the way call centers discovered problems before they had analytics — through a supervisor listening in on a bad call, or through a customer who complained loudly enough.
What grounded evaluation requires
Mollick uses the phrase “grounded quality definitions” in his work on management as an AI superpower, and it’s worth unpacking. A grounded definition ties quality to real outcomes, not to the output’s surface characteristics. Not “did the response sound confident” or “was the response coherent” but “did the user accomplish what they were trying to accomplish” or “was the calculation correct.”
This turns out to be harder than it sounds, and it’s where most eval infrastructure falls down. It’s easy to build evals that measure proxy metrics — coherence, length, similarity to a reference answer. It’s harder to build evals that measure whether the system is actually doing the right thing.
Two requirements that I keep coming back to. Quality has to be grounded: tied to real outcomes, not proxy signals. And task completion has to be verifiable: you can actually check whether it happened, independent of the system’s confidence. A financial analysis is verifiable if you can check the arithmetic. A navigation response is verifiable if you can test whether the route makes sense for someone in a car. A customer support response is verifiable if you can check whether the underlying issue got resolved.
Where these two conditions hold, you can build real evaluation infrastructure. Where they don’t — where quality is subjective or completion is hard to verify — you’re flying blind, and the jagged frontier will find the edges of your understanding before your evals do.
The frontier shifts
One more thing that makes this hard: the frontier isn’t static.
Every model update moves it. Some capabilities improve. Some degrade. Some new failure modes appear that didn’t exist before. The BCG study was run at a specific point in time with specific models. The specific numbers (40% quality improvement, 19% more errors) are already dated. The structural insight — that AI performance is jagged, with benefits and risks that aren’t correlated — will remain true for the foreseeable future, even as the specific contours of the frontier change.
This is why evaluation can’t be a one-time activity. You can’t benchmark your system in January, deploy it, and assume the frontier stays put. It doesn’t. Model updates are the obvious trigger, but user behavior drifts too — the distribution of inputs your system sees in month six is different from what it saw in month one. The task mix shifts. New use cases get discovered. What was inside the frontier may now be outside it, and vice versa.
With Jetty, we’re tackling this with agent runbooks, but we’re not alone in trying to build agentic hill-climbing as a service.
The “fix everything and measure once” assumption is exactly the wrong model. The right model is continuous: define quality, measure it, improve it, measure again. A living process, not a checkpoint.
The question worth sitting with: if the jagged frontier is real, and measurable, and shifts with every model update — what would it actually take to map it before your users discover the cliffs for you?
I don’t think the answer is more dashboards. Dashboards show you aggregate metrics on the tasks you’re already measuring. The dangerous zone is the unmapped territory. Finding it requires deliberately probing the edges — constructing evals for tasks you think are covered but haven’t tested, running them when models update, building the granular per-task quality infrastructure that call centers take for granted.
It’s boring work. It’s not a new feature. Nobody’s going to celebrate the eval suite you built. But it’s the difference between knowing your system’s frontier and hoping you’ve been lucky about where the cliffs are.


