Research Closes the Loop. Production Keeps Us In It.
Why we kept the outer loop open by design
An ICLR Oral just showed that reflective prompt evolution beats reinforcement learning.
The paper, GEPA, describes a search procedure where a language model reads its own rollouts and proposes prompt edits in plain English. The search uses those edits as its mutation operator. The result beats GRPO — the RL baseline from DeepSeekMath — across three benchmarks: HotpotQA, AIME, IFBench. Fewer rollouts. No reward model.
The paper is right. I want to say that first, because most of what follows could read like a disagreement, and it isn’t. The inner generate-judge-refine loop GEPA describes is the same loop a Jetty runbook runs on every execution. The mechanism is real and it generalizes.What’s interesting isn’t whether GEPA validates the runbook approach. It does. What’s interesting is the part of the loop GEPA closes that we deliberately keep open.
The inner loop is generator-verifier
GEPA’s contribution is the reflection step. A language model reads its rollouts and identifies the failure pattern in natural language. That diagnosis becomes the mutation operator the search uses to propose the next prompt.
This isn’t a one-off result. It’s the third optimizer the DSPy team has shipped, after bootstrap-fewshot and MIPROv2. The closest methodological cousin is TextGrad — backpropagating textual feedback through compound LLM systems. The evolutionary ancestor is Promptbreeder. Three years of lineage, not a single ICLR moment.
The thing that makes any of it work is generator-verifier asymmetry. I’ve written about this before — producing a good answer is hard; recognizing one is cheap; the gap between the two is the only reason iterative loops ever converge. The verifier — a rubric, a benchmark, a unit test — has to stay independent of the thing generating candidates. Without that independence, you’re not climbing a gradient. You’re staring into a mirror.
Our runbooks run this exact loop on every execution. The rubric step (”what does done look like, how do we check”) is the verifier. The bounded retry budget is the search. The agent reads its rubric failure and tries again. It stops when the rubric clears or the budget runs out. That’s GEPA’s inner loop, applied to the artifact instead of the prompt.
Two lenses on the outer loop
The split between GEPA and runbooks occurs on what happens across runs, not within one: research optimizes for “did the score go up.” Production optimizes for “did the right people see the change before it shipped.”
In research you have a benchmark. The benchmark is ground truth. Closing the loop by auto-evolving the prompt against rollout data is unambiguously good. The eval signal is trustworthy and the cost of a regression is “the number went down on a graph.” However, in production you have a runbook running brand-compliance review on every marketing draft. The rubric is your team’s judgment compressed into prose. The eval signal is a proxy. It encodes what you decided “good” means three months ago.
Auto-evolving against that proxy means the runbook can drift in a direction nobody on the team noticed. Worse: if the system can rewrite the rubric and optimize against it, the verifier stops being independent of the generator. The asymmetry that made the inner loop work breaks down. You haven’t built a self-improving system. You’ve built a self-justifying one.
We could automate this. We chose not to.
This is the load-bearing point.
Jetty has all the parts. `/optimize-runbook` already reads trajectories and proposes runbook edits. Routines fire on a cron. Wiring up nightly auto-evolution against last week’s trajectories is a few config lines. The trajectory storage GEPA’s approach needs as a substrate is already first-class.
We didn’t ship that as the default. What ships is a runbook in git. Someone runs /optimize-runbook when they suspect drift. The diff lands in a PR like any other change. Closed loop available; open loop default.
This isn’t theoretical. Decagon ran GEPA on real customer-service prompts and wrote up what happened. Naive runs produced ~5,000-character prompts that overfit to small reflection sets. Smaller reflection models broke outright. The optimal sample range turned out to be 20–100, not “more is better.”
Their fix amounted to code review for prompts. They added a holdout set so the optimizer couldn’t lie about its own progress. Length regularization stopped the prompts from sprawling. The whole optimization started getting treated like the test-driven engineering loops everyone already trusts for shipping software.
That’s the production reality the academic version doesn’t have to think about.
The merge step is the feature
Production runbooks aren’t AIME problems. They’re operational artifacts that govern real outputs to real people, and the reasons to keep a human in the merge step are the same reasons engineering teams keep humans in the merge step for code.
Someone has to know the rubric tightened. Otherwise the team is debugging “why did our outputs change last Tuesday” with no change to point at. A runbook that silently rewrites itself can’t be pinned to a sprint or to an experiment, much less an audit window — you lose the ability to say “this output came from runbook v1.3.” When the runbook ships a bad output, “who changed this and why” needs an answer. A diff in git with a reviewer attached has that answer. A self-evolving prompt does not.
This is the PR-is-the-product thesis applied to the runbook itself. The runbook diff is the artifact your team reviews. The fact that an agent proposed the diff doesn’t change who owns the merge.
Engineers already trust this interface. Auto-formatters write code. Linters fix style. Your test runner can fail a build with no human signoff at all. All closed loops. Auto-merge to main? Not yet. Branch protection, CODEOWNERS, required reviewers — that’s the merge-policy interface that makes every other closed loop safe to ship. Without it, every commit is an unsupervised optimization against an untrusted verifier.
What’s actually open
GEPA’s claim isn’t wrong for its domain. If you’re tuning a prompt against a benchmark with trustworthy ground truth, closing the loop is the right move. The literature is also more contested than the ICLR headline suggests and Benjamin Anderson’s structural critique argues agents don’t have the locality property modular optimization assumes.
The thread to pull on is narrower. Where in the runbook lifecycle does the loop close, and where does the human stay? The honest answer depends on the cost of a silent regression. A runbook that drafts internal Slack messages can probably auto-merge proposed changes. A runbook that touches customer comms cannot. The interesting design question isn’t “should we automate.” It’s: what’s the merge-policy interface for runbooks, and what does it look like to declare it explicitly?


