AI Breaks Assumptions Behind Software Testing
We need new tools to fix agentic systems.
Automated software testing unlocked the ability for teams to incrementally deliver complex systems at scale. At the heart of this achievement are two assumptions that no longer hold with AI agents:
Keep code and data separate. Manage them independently.
Fix everything and measure it once. Test at a point in time and trust the results.
AI systems break both. We need a new playbook for reliably delivering the next generation of applications.
We steer the output of AI systems by co-mingling examples and instructions with engineered context: RAG pipelines, prompt templates, post-trained models. An AI agent performs completely differently depending on what information it sees and in what format. Measuring it divorced from production data gives you a proxy, not a result.
And unit tests were built for a deterministic world where you controlled the entire environment. Today, most AI systems call out to third-party models that can change behavior without notice. Failure is no longer binary. You need to evaluate a distribution of outcomes: confusion matrices, ROC curves, calibration checks. Not just pass/fail.
The evaluation gap
A friend’s startup lost 5% of their customers from a single bad AI deployment. They switched LLM providers – lured by a cheaper, faster model and a strong public leaderboard ranking – only to discover that tool-calling APIs behave differently between providers. What worked on one model broke on another. A colleague lost three months of work because a foundation model had been trained on an upstream dataset whose licensing expired. I watched data science teams speak with conviction about the elegance of their algorithms, only to see their work dismantled by a single edge case from a domain expert.
I founded Jetty because I kept watching this gap between lab measurement and production reality destroy real value.
The agile parallel
I started my career when the agile movement, unit testing, and continuous delivery were upending “big bang” releases and waterfall development. Watching AI mature, I saw the same pattern repeating: AI was entering its own continuous delivery moment, and it had none of the infrastructure to support it.
SaaS took off when thousands of engineers could ship code to production every day. The key unlock was automated testing. Fix the code, fix the inputs, verify the outputs. It worked because the system was deterministic and the environment was under your control.
But what does automated testing look like when your data and your code are the same thing? What does it mean to test a system when the foundation model powering it is also changing underneath you?
Everything is moving at once
Classic software testing fixes code and inputs, then checks outputs. Data science teams do something similar: fix whole datasets and run them against fixed models. Both approaches assume you can hold things still long enough to measure them.
In production AI systems, nothing holds still. The data keeps changing. The models keep changing. The retrieval systems keep changing. The code is also incrementally changing. You can’t fix everything and measure it once.
Bounding risk
This might feel uncomfortable. If the system isn’t deterministic, how can you trust it?
Consider a speech-to-text scribe deployed for a healthcare provider. It works great on Parisian French but fails on Quebecois French. A patient says “la fin de semaine” instead of “le weekend,” or “un courriel” instead of “un mail,” and the system misinterprets critical details. A one-time evaluation would never catch this. An iterative evaluation system, one that continuously collects data from the populations actually being served, would catch these failures and fold them into benchmarks for future improvement.
All risk can be bounded. We trust risk-laden systems every day. We drive cars knowing catastrophic outcomes are real, but we’ve built layered infrastructure (speed limits, seatbelts, crumple zones, insurance) that makes the risk manageable.
AI systems need equivalent layers. Some live in production: observability, telemetry, evaluation datasets wired into CI/CD pipelines. Others run safely outside it: benchmarks, trace labeling, synthetic dataset generation. The key is closing the loop. Safely and privately feeding development teams the signals they need to understand how their system actually performs in the field.
The agile movement took a decade to go from manifesto to mainstream practice. AI systems are being deployed to millions of people right now. I don’t think we have ten years to close the AI evaluation gap.

