Foundation Models Ship Like Windows 98
AI shipped Big Bang releases back to production
Early in my career, I watched coworkers dread Oracle upgrades the way you’d dread a root canal. Weeks before a release, the mood would shift. Development froze. Test plans ran to a hundred pages. Someone would pick a Friday night, and the whole team would brace for impact. If the upgrade failed, Saturday was a rollback. If it succeeded, Monday was triage.
We called these Big Bang releases. Months of change compressed into a single explosive moment, with everyone hoping the blast didn’t take production down with it.
The pressure was the part that stayed with me. As release dates approached, you could feel it building across the team. Not excitement. Dread. The kind of tension that comes from knowing you’ve accumulated so much change that nobody can predict what will happen when you flip the switch.
The software industry spent the next fifteen years dismantling this approach. Continuous integration. Automated testing. Feature flags. Deploy-on-merge.
The early adopters made the rest of us look like we were standing still. In 2009, Flickr’s John Allspaw and Paul Hammond got on stage at Velocity and announced they were deploying to production ten or more times a day. That talk kicked off the DevOps movement. Etsy took it further — new engineers shipped code to production on their first day, and by 2014 they were pushing 80+ deploys daily with 150 engineers. Amazon hit a deploy every 11.6 seconds by 2011. Netflix open-sourced Spinnaker and built a deployment culture so confident they’d randomly kill production services just to prove they could recover.
By 2015, a team of two thousand engineers could push hundreds of commits a day, each one tested, deployed, and monitored independently. The Big Bang was dead.
Or so I thought.
Shrink-wrap AI
Watch how foundation model providers ship. OpenAI releases GPT-4, then GPT-4 Turbo, then GPT-4o. Anthropic releases Claude 3, then 3.5, then 4. Each version is a major event with a blog post, a benchmark table, and a wave of developers scrambling to figure out what changed.
This is shrink-wrap software. It’s the same mentality that gave us Oracle 8i and Windows ME: build it behind closed doors, stamp a version number on it, and push it out the door. The release itself is the artifact. Everything that happened between versions is invisible to the people who depend on the product.
For the model providers, this might be unavoidable. Training runs are expensive, and you can’t exactly deploy a half-trained model to production. But for every team building on top of these models, inheriting the shrink-wrap mentality is a choice. And it’s the wrong one.
The second Big Bang
Here’s what I see in practice. A team builds an AI feature against GPT-4. They test it, tune the prompts, get the outputs to an acceptable quality. Ship it. Months pass. OpenAI deprecates the model version. Or a new model comes out that’s cheaper and supposedly better.
Now the team faces a familiar decision: migrate everything at once or fall behind. They pick a weekend. They swap the model. They rerun their evaluation suite, which is a gold dataset they assembled six months ago, and the numbers look fine.
Monday morning, support tickets start rolling in. The new model handles tool calls differently. It’s more verbose in some cases, terser in others. Edge cases that the old model handled gracefully now produce hallucinations. The evaluation suite didn’t catch any of this because it was testing against a frozen snapshot of reality, not reality itself.
This is the Big Bang release, reborn. Different technology, same failure mode.
What CI actually solved
The insight behind continuous integration wasn’t just “deploy more often.” It was that small, incremental changes are fundamentally easier to reason about than large ones.
When you deploy a single commit, you know exactly what changed. If something breaks, you know where to look. When you deploy six months of accumulated changes in one shot, you’re debugging a combinatorial explosion. Every change interacts with every other change, and the failure could be anywhere in the stack.
CI worked because it made each change small enough to understand, test, and roll back independently. It didn’t eliminate risk. It made risk manageable.
AI needs the same loop
The parallel to AI systems is direct, but the loop looks different.
In traditional CI, the cycle is: write code, run tests, deploy, monitor, repeat. The code changes. The tests and infrastructure stay relatively stable.
In AI systems, everything moves. The model changes when providers ship updates. The data changes as users interact with your system. The prompts change as your team iterates. The retrieval corpus changes as new documents get indexed. You can’t hold any of it still long enough to test it the way you’d test a REST endpoint.
So what does CI for AI actually look like?
Grab production data. Not a curated test set from six months ago. Actual inputs and outputs from your running system. This is the raw signal that tells you how your system behaves in the real world.
Sanitize and transform. Strip PII, anonymize where needed, handle compliance requirements. This is non-trivial but it’s infrastructure, not a blocker. Teams that treat privacy as a reason to avoid production data entirely end up flying blind.
Label and categorize. Not everything needs human labeling. Cluster similar inputs. Flag anomalies. Use LLM-as-judge for initial quality assessments. Build a living picture of what your system handles well and where it struggles.
Benchmark against the next iteration. Before you swap a model, change a prompt, or update your retrieval pipeline, run the proposed change against your current production-derived evaluation set. Not a public leaderboard. Not an academic benchmark. Your data, your users, your edge cases.
Deploy incrementally. Don’t swap everything at once. Canary the change. Route 5% of traffic to the new configuration. Compare quality metrics side by side. Expand or roll back based on evidence, not hope.
Feed the results back. The outputs from this cycle become the inputs to the next one. New failure modes get added to the evaluation set. Successful optimizations become the new baseline. The loop tightens with every iteration.
This is the CI/CD loop applied to AI. It’s not a metaphor. It’s the same engineering discipline, adapted for a system where both the code and the data are moving targets.
Why teams resist this
The most common objection is that it’s too much infrastructure for the payoff. “We’re a small team. We just need to ship features.”
I heard the exact same argument against automated testing in 2008. Teams that skipped CI shipped faster for a quarter, then spent the next year debugging integration failures and botched releases. The teams that invested early moved slower at first but compounded their advantage over time.
The second objection is that AI is different. Models are black boxes. You can’t unit test a neural network the way you test a function.
This is true and also beside the point. You don’t need to unit test the model. You need to continuously evaluate the system: the model plus the prompts plus the retrieval plus the post-processing plus the guardrails. The system is what your users interact with. The system is what you can measure and improve.
The compounding advantage
Teams that close this loop gain something that’s hard to replicate: a continuously improving evaluation set derived from their actual production traffic. Every week, their benchmarks get more representative. Every iteration catches more edge cases. Every deployment carries less risk because the safety net is woven from real-world data, not synthetic test cases.
Teams that don’t close the loop are stuck in the 90s. They test against stale datasets. They deploy model swaps as Big Bang releases. They discover problems from support tickets instead of automated analysis. Each deployment is a gamble, and the odds don’t improve with time.
The gap is closing
The tooling for AI CI/CD is maturing fast. Platforms like Langfuse give you the trace data. LLM-as-judge frameworks handle automated quality assessment. Prompt management tools support versioning and A/B testing. The pieces exist.
What’s missing for most teams isn’t tooling. It’s the mindset shift. The recognition that AI systems aren’t something you build, test, and ship. They’re something you continuously operate, measure, and improve. That’s what CI meant for software. It’s what CI needs to mean for AI.
The software industry took a decade to move from Big Bang releases to continuous delivery. AI systems are being deployed to millions of people right now. We don’t have a decade. But we do have the playbook. We’ve done this before.
The question is whether your team is running the 2025 version of the loop, or the 1998 one.



