Valley-Dodging: The Half of Agent Reliability You're Not Optimizing
Why your best runbook lines were never about the score
You can ship an agent that passes every eval you wrote and still get paged at 2am, because a runbook scoring 88% with a green test suite tells you nothing about the one run in twenty where it does something you’d never have signed off on: it wipes a directory it had no business touching, or drops a customer record into a log where anyone can read it. Almost everything we do to make an agent better is aimed at lifting those scores and improving our workflow. We chase the number up: we hill climb. But chasing the number is only half the story.
The people who train models for a living have learned the other half the hard way, and I spent a few years as one of them before I built developer tools. The signal that kept causing trouble was never the reward for good behavior; it was the penalty for bad behavior. What you do with the runs that went wrong, it turns out, matters more and behaves worse than what you do with the runs that went right.
When the downside can’t be undone
Why should one run in twenty outweigh the other nineteen? Because some failures don’t average out. Nassim Taleb calls the kind that matters an absorbing barrier: a state that, in his words, “prevents people with skin in the game from emerging from it,” a loss that swallows every gain that came before it and forecloses every gain that might have come after. Bankruptcy is one. So is a deleted production database, or a customer record that has already been scraped out of a log. Once an agent can reach a state like that, the average stops being a number you can trust, because Taleb’s rule for ruin is blunt: when the downside is unrecoverable, cost-benefit analysis no longer applies. Sequence is what does the damage, since ninety clean runs followed by the one that wipes the database never settle into a comfortable expected value: after the wipe, there is no ninety-first run.
Why the negative signal is worth the danger
None of this means you should simply suppress the bad runs, because the negative signal is also where a surprising amount of the improvement comes from: penalizing failure, done well, moves a model toward good behavior faster than reward alone can. The trouble is that it is powerful in both directions. When you reward good trajectories at full strength the worst case is dull, because nothing in that direction runs away and the model just learns slowly; when you penalize bad trajectories the same careless way the math comes apart, because the penalty term has no floor and will drive the model into garbage if you let it. My own brush with this was a paper, Tapered Off-Policy REINFORCE, with Marc Bellemare and Nicolas Le Roux, whose whole result was a way to keep the penalty’s value while bounding its blast radius. But TOPR is one data point in a broad body of work that keeps reaching the same conclusion: the positive and negative signals need different machinery, and the negative one is the half that demands care.
Valley-dodging is the other half
So the number is not enough, because chasing a better average is hill-climbing, and reliability needs the opposite motion, which barely has a name. Call it valley-dodging: the work of keeping the agent away from the runs you can’t take back. You are already doing it by hand. Open any runbook that has survived real traffic and you’ll find it full of lines like “do not skip this step,” “never write outside the results directory,” “do not report success if a check failed.” Nobody wrote those to raise a score; each one is a fence around a specific cliff something fell off of once, in a way that hurt. (I keep a “never run git push” line in one runbook because an agent, told to commit its work, once helpfully pushed a half-finished branch to main.) Given that the downside is an absorbing barrier, treating a catastrophic failure as far heavier than an ordinary miss isn’t a bias to correct for; it’s just correct accounting.
The tooling, though, still only knows how to climb. When `/optimize-runbook` reads your trajectories, it finds what’s dragging the pass rate down and proposes edits that pull it back up, which works right up until it has to tell a slightly worse answer apart from an unrecoverable one, because both reach it as the same lower number. And the answer is not to pile on prohibitions, because if you stack enough hard “never do this” rules the agent goes brittle: it starts refusing reasonable asks, or it honors the letter of every rule while walking past the point of the task. That is the same failure as the unbounded penalty in training, so the craft is the same: keep the negative signal, but bound it.
What I keep turning over is whether a runbook should declare its failure modes the way it declares its steps, as a short typed list with severities, so the optimize loop could weigh one catastrophe against ten ordinary misses instead of folding everything into a single pass/fail number.




