Lights-Out Manufacturing Had a Brake Pedal
The lights are going off in software the same way they went off in manufacturing
An agent I was watching a six-step extraction pipeline. It made it to step three, hit a schema mismatch, and “recovered” by skipping to step five. Then it wrote a polished summary citing results from a step that never ran.
$54 in API spend. Three days for someone to figure out why the downstream dataset was corrupted. The trace was sitting in Langfuse the whole time.
I spent this week at Web Summit in Vancouver talking to engineering leaders driving AI adoption inside their companies. Every one of them had a version of this story — and the common thread was that the agent habits forming inside their dev environments are landing in production unchanged. There’s no staging environment for agent behavior. A skill that silently skips a step on a developer’s laptop silently skips it in production — at customer scale, with customer data.
This is the failure mode missing from the conversation about autonomous AI. Not “the model can’t do the task.” The model can do the task. The system around it doesn’t know when the task wasn’t done — and the same gap that costs $54 in development costs orders of magnitude more in production.
The lights are going off anyway
There’s a name for what the industry is building toward. The dark factory.
At the foot of Mount Fuji, FANUC’s Oshino plant produces fifty industrial robots per twenty-four-hour shift and runs unmanned for up to thirty days at a stretch. Gary Zywiol, the company’s vice president, has a quote about it that’s now part of the canon: “Not only is it lights-out, we turn off the air conditioning and heat too.” Siemens runs lights-out plants. Xiaomi opened an 81,000-square-meter unmanned smartphone facility in 2024 that produces ten million units a year with no humans on shift. The International Federation of Robotics put China at over two million factory robots in 2024 — fifty-four percent of global demand. A peer-reviewed survey in *Engineering Manufacture* this year called lights-out factories “the pinnacle of manufacturing advancement.”
The software industry is racing toward the same idea from the opposite direction. BCG Platinion has been writing about *The Dark Software Factory* — AI agents handling planning, coding, testing, deployment, with no human on the PR. Stripe is already operating at 1,300-plus AI-authored pull requests per week. The destination isn’t theoretical. The lights are going off.
Here’s what I keep coming back to about Oshino, Xiaomi, and every other plant where the lights are already off: they work because the operational half of the stack existed first.
What made lights-out actually work
Specified tolerances and statistical process control. QA loops you instrument before you turn off the lights. Shutdown criteria — what halts the line, what alarms, what gets quarantined. Calibration schedules nobody can override.
None of these are sexy. All of them existed before the humans left the floor. Olivier and Craig’s 2017 IEEE AFRICON paper Lights-out process control — analysis and framework puts it bluntly: every unmanned production line is wrapped in a deterministic decision layer that knows when to halt, alarm, retry, or quarantine. The robots aren’t the achievement. The discipline around the robots is.
The manufacturing industry spent forty years building that layer. There’s another IEEE paper from 2017, Industrial robotics in factory automation: from the early stage to the Internet of Things, that traces the arc — from the first numerically-controlled machines through the SCADA era and on into the modern smart factory. The throughline is the same in every chapter. The operational layer always shows up before the lights go off. Not after.
AI is doing it backwards. The agents are loose in production. The operational primitives — observability that surfaces patterns instead of just collecting data, durable orchestration that survives a crash on step seventeen, end-to-end evals that catch regressions before customers do, output verification that knows what “done” looks like — are still being hand-built by every team independently.
Dark factories without observability and evaluation are motion without progress.
What’s actually missing
In May, Tobi Coker at Felicis published a survey of twenty-three AI-native engineering leaders that put numbers on this. *The AI Stack Is Half-Built*. It’s worth reading in full.
The headline that’s stayed with me: 69.6% of teams doubled their inference spend in six months. 34.8% now report inference running at five times their training cost. And 56.5% of those teams are managing the fastest-growing line item in modern software with home-built spreadsheets and custom dashboards. The infrastructure category that should exist doesn’t — yet.
47.8% of the surveyed teams are running autonomous agents in production. One respondent named the missing primitive plainly: “retry this 20-step agent run from step 10.” Nobody ships that. So teams duct-tape frameworks together. “Stitching it,” another engineer wrote, “still takes too much custom glue.”
45% named evaluation as their single biggest unsolved problem. The hard part isn’t testing individual LLM calls — it’s running evals across the entire system, end-to-end, with state and tool calls and branching. 57.9% would rather build their own eval framework than buy an existing one. Not because they want to. Because nothing they’ve found does what they need.
What’s striking isn’t any individual number. It’s that the three gaps describe the same hole from three angles. Observability that doesn’t surface patterns. Orchestration that can’t recover from a partial failure. Evaluation that can’t measure the system end-to-end. These aren’t three problems. They’re one problem — the missing operational layer — measured by three groups of people who don’t yet have a shared name for what they’re missing.
I’ve spent the last year auditing production Langfuse projects. We’ve crossed 100,000 traces across voice AI, agentic tooling, healthcare pipelines, extraction services. The patterns the Felicis survey describes aren’t visible only from the top down. They show up in every audit. Redundant API calls because nobody had a cache. System prompts resent on every turn because nobody had configured prompt caching. Five GPT-4o versions running simultaneously because pinning was easier than maintaining. Agentic workflows accumulating $50 in a single trace because the operational layer that should have summarized intermediate context wasn’t there.
The leaders in the survey and the production systems I see in audits are looking at the same thing. They just don’t have the vocabulary to coordinate.
We’ve done this before
The agile movement spent fifteen years building automated testing, feature flags, blue-green deploys, canaries, and continuous delivery before deploy-on-merge became safe enough to be the default. I lived through the front half of that arc. The pattern was always the same: a new way of shipping arrived, the old operational discipline didn’t fit, and the industry spent a decade reverse-engineering what “done” and “broken” should mean in the new model.
Big Bang Oracle releases. Friday-night dread. The rollback Saturday. The post-mortem Monday. The connection between that older pattern and the modern foundation-model-version-swap dread is direct, and I’ve written about it elsewhere. The shape of the failure is the same. So is the shape of the fix.
The fix was never a new framework. It was operational discipline written down where everyone could read it. Test suites. Runbooks. Smoke tests. Verification scripts. The lift wasn’t in inventing new tools. It was in agreeing on what done meant and refusing to call something done until it passed.
The contrarian bet
Here’s what I think is the easy bet to miss right now.
Coker’s three gaps will eventually become VC-funded categories. Inference observability will get its Datadog. Agent orchestration will get its Spinnaker. End-to-end evaluation will get something — maybe two or three things — that finally does for AI what unit tests did for code. The market will sort it out. Some bets will pay off. Most won’t. The category-defining product for each of these gaps doesn’t exist yet, which is exactly why VCs are paying attention.
But the teams that pull ahead aren’t waiting for the category-defining product. They’re building the operational layer with what they already have.
And the operational primitive we keep returning to — across hundreds of customer trajectories and a year of audits — is a markdown file with a rubric and a bash verification script at the bottom.
That’s not a slogan. It’s what survives load. The sophistication isn’t in the format. It’s in what the document encodes: what “done” means, how to evaluate against it, what to do when it falls short, when to stop trying. We call this a runbook because the older shape it most resembles is the operations runbook on the wall behind a NOC. But it’s also a test suite. And a spec. And a quality gate. And a refusal to call something done until it passes.
The contrast is funny. The industry is racing to build increasingly elaborate agent frameworks — graph planners, multi-agent orchestrators, memory layers, MCP everything — while the most reliable primitive in our toolbox is plain text with a checker. Markdown works with every agent. It’s version-controlled. It’s diffable. It’s reviewable in a pull request. It doesn’t need a runtime. It’s the lightest possible structure that still encodes the operational discipline a dark factory actually requires.
The dark factory is coming. The lights are going off in software the way they went off in manufacturing. But Oshino doesn’t run because the robots are smart. It runs because the deterministic decision layer around them never lets a bad part through.
If you’re building the AI version of that — and if you’re shipping agents into production, you are, whether you’re using that vocabulary or not — the operational half of the stack is your problem. Not Coker’s category-defining vendor. Not next quarter’s better orchestration framework. Yours. This week.
You need four pieces. The dashboard, which most teams already have. The runbook, which does the operational lift underneath. The PR, which is the artifact the team reviews and merges. And — bluntly — the eval. That’s your brake pedal.
Build those four. The rest of the stack can be half-built indefinitely.
What I keep wondering is where the ceiling is. If a markdown file with a checker is good enough for everything we’ve thrown at it so far, where does it break — and what does an agent task look like when plain text and a bash script stop being enough?


