<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[The Jetty Blog: Ground Truth]]></title><description><![CDATA[Lessons learned building the most reliable agent workflows.]]></description><link>https://blog.jetty.io</link><image><url>https://substackcdn.com/image/fetch/$s_!B6E6!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F644cf333-50e2-43ea-b074-37304c7162cc_1280x1280.png</url><title>The Jetty Blog: Ground Truth</title><link>https://blog.jetty.io</link></image><generator>Substack</generator><lastBuildDate>Mon, 08 Jun 2026 04:31:57 GMT</lastBuildDate><atom:link href="https://blog.jetty.io/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Jonathan Lebensold]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[lebensold@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[lebensold@substack.com]]></itunes:email><itunes:name><![CDATA[Jonathan Lebensold]]></itunes:name></itunes:owner><itunes:author><![CDATA[Jonathan Lebensold]]></itunes:author><googleplay:owner><![CDATA[lebensold@substack.com]]></googleplay:owner><googleplay:email><![CDATA[lebensold@substack.com]]></googleplay:email><googleplay:author><![CDATA[Jonathan Lebensold]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[A Pelican Learns to Ride]]></title><description><![CDATA[What happens when every iteration of a benchmark is on disk]]></description><link>https://blog.jetty.io/p/a-pelican-learns-to-ride</link><guid isPermaLink="false">https://blog.jetty.io/p/a-pelican-learns-to-ride</guid><dc:creator><![CDATA[Jonathan Lebensold]]></dc:creator><pubDate>Fri, 22 May 2026 16:47:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!cbE5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77c2c296-792e-4a9c-b7b8-9959184d4cd6_1408x768.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>People ask, occasionally, why our logo is a pelican. A company called Jetty, a bird on the dock &#8212; there must be a story. The honest answer is anticlimactic, and I&#8217;ll get to it.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cbE5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77c2c296-792e-4a9c-b7b8-9959184d4cd6_1408x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cbE5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77c2c296-792e-4a9c-b7b8-9959184d4cd6_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!cbE5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77c2c296-792e-4a9c-b7b8-9959184d4cd6_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!cbE5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77c2c296-792e-4a9c-b7b8-9959184d4cd6_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!cbE5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77c2c296-792e-4a9c-b7b8-9959184d4cd6_1408x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cbE5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77c2c296-792e-4a9c-b7b8-9959184d4cd6_1408x768.jpeg" width="1408" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/77c2c296-792e-4a9c-b7b8-9959184d4cd6_1408x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:932059,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://lebensold.substack.com/i/198865874?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77c2c296-792e-4a9c-b7b8-9959184d4cd6_1408x768.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cbE5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77c2c296-792e-4a9c-b7b8-9959184d4cd6_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!cbE5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77c2c296-792e-4a9c-b7b8-9959184d4cd6_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!cbE5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77c2c296-792e-4a9c-b7b8-9959184d4cd6_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!cbE5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77c2c296-792e-4a9c-b7b8-9959184d4cd6_1408x768.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The reason the question comes up at all is <a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/">Simon Willison</a>. For the last year or so, he&#8217;s been asking every new model to generate an SVG of a pelican riding a bicycle and posting the results. It&#8217;s become the unofficial visual reasoning test for frontier models. So when people see the Jetty mascot, the assumption is that we picked the pelican as a wink at the benchmark.</p><p>We didn&#8217;t. <a href="https://www.linkedin.com/in/taranakhjavani/">Our designer</a> worked through a stack of jetty-adjacent motifs &#8212; boats, ropes, lighthouses, a few different seabirds &#8212; and the pelican was the one the team kept coming back to. That decision predates the benchmark mattering to us. The overlap is coincidence. A good one, but coincidence.</p><p>It is, however, the kind of coincidence you don&#8217;t ignore.</p><h2>So we ran the benchmark</h2><p>Simon&#8217;s test is a good one. It exercises layout, anatomy, two distinct subjects, motion, and the model&#8217;s ability to keep coherent state across hundreds of XML elements. It&#8217;s visual enough that you can tell at a glance whether the model got it.</p><p>We ran the same task eighteen times &#8212; eleven times with one model while iterating the prompt, seven times with one prompt while iterating the model &#8212; and captured every trajectory. The result is at <a href="https://pelicans.jetty.bot">pelicans.jetty.bot</a>. Seven agent/model combinations. Seventy-one runbooks across seven lineages. Every iteration replayable.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://pelicans.jetty.bot/head-to-head#hermes-flash/v1" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!G18M!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe2a7a74-c2e5-4fcd-9edf-ae30d70f3cbb_1202x906.png 424w, https://substackcdn.com/image/fetch/$s_!G18M!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe2a7a74-c2e5-4fcd-9edf-ae30d70f3cbb_1202x906.png 848w, https://substackcdn.com/image/fetch/$s_!G18M!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe2a7a74-c2e5-4fcd-9edf-ae30d70f3cbb_1202x906.png 1272w, https://substackcdn.com/image/fetch/$s_!G18M!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe2a7a74-c2e5-4fcd-9edf-ae30d70f3cbb_1202x906.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!G18M!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe2a7a74-c2e5-4fcd-9edf-ae30d70f3cbb_1202x906.png" width="1202" height="906" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fe2a7a74-c2e5-4fcd-9edf-ae30d70f3cbb_1202x906.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:906,&quot;width&quot;:1202,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:677183,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;https://pelicans.jetty.bot/head-to-head#hermes-flash/v1&quot;,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lebensold.substack.com/i/198865874?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe2a7a74-c2e5-4fcd-9edf-ae30d70f3cbb_1202x906.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!G18M!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe2a7a74-c2e5-4fcd-9edf-ae30d70f3cbb_1202x906.png 424w, https://substackcdn.com/image/fetch/$s_!G18M!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe2a7a74-c2e5-4fcd-9edf-ae30d70f3cbb_1202x906.png 848w, https://substackcdn.com/image/fetch/$s_!G18M!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe2a7a74-c2e5-4fcd-9edf-ae30d70f3cbb_1202x906.png 1272w, https://substackcdn.com/image/fetch/$s_!G18M!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe2a7a74-c2e5-4fcd-9edf-ae30d70f3cbb_1202x906.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Hermes Agent + Gemini 3.5: the chonkiest pelican on the jetty.</figcaption></figure></div><p>The point isn&#8217;t the pelicans. The point is what you can do once each attempt is a structured artifact instead of a screenshot in a tweet.</p><h2>Three views</h2><p><strong>The Climb</strong> ranks all seven agent/model combos on the same v1 runbook. Scatter chart, best-of-ten table. You can see the shape of how each agent&#8217;s score moved across its ten rounds &#8212; not as a final number, but as a curve.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://pelicans.jetty.bot/head-to-head#opencode/v1" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!C6un!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f81a96b-9ef7-4f00-98d2-c0a8d440931b_2426x944.png 424w, https://substackcdn.com/image/fetch/$s_!C6un!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f81a96b-9ef7-4f00-98d2-c0a8d440931b_2426x944.png 848w, https://substackcdn.com/image/fetch/$s_!C6un!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f81a96b-9ef7-4f00-98d2-c0a8d440931b_2426x944.png 1272w, https://substackcdn.com/image/fetch/$s_!C6un!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f81a96b-9ef7-4f00-98d2-c0a8d440931b_2426x944.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!C6un!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f81a96b-9ef7-4f00-98d2-c0a8d440931b_2426x944.png" width="1456" height="567" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2f81a96b-9ef7-4f00-98d2-c0a8d440931b_2426x944.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:567,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1135819,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;https://pelicans.jetty.bot/head-to-head#opencode/v1&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lebensold.substack.com/i/198865874?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f81a96b-9ef7-4f00-98d2-c0a8d440931b_2426x944.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!C6un!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f81a96b-9ef7-4f00-98d2-c0a8d440931b_2426x944.png 424w, https://substackcdn.com/image/fetch/$s_!C6un!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f81a96b-9ef7-4f00-98d2-c0a8d440931b_2426x944.png 848w, https://substackcdn.com/image/fetch/$s_!C6un!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f81a96b-9ef7-4f00-98d2-c0a8d440931b_2426x944.png 1272w, https://substackcdn.com/image/fetch/$s_!C6un!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f81a96b-9ef7-4f00-98d2-c0a8d440931b_2426x944.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">OpenCode hill-climbing</figcaption></figure></div><p></p><p><strong>Head-to-Head</strong> is a filmstrip viewer. Arrow keys step through rounds; up/down switches agents. You&#8217;re watching the same task unfold ten times in parallel for each lineage, and the convergence is its own kind of evidence. Some lineages crawl. Some snap into shape on round three and then drift.</p><p><strong>Runbook Diffs</strong> is the view I keep coming back to. The runbook carries a baseline SVG embedded as a seed &#8212; so iterating the runbook means editing that seed plus the targeted prompt asks. Pick any two versions across the seven lineages and diff them line by line, side-by-side or unified. It looks exactly like reviewing a pull request. Because that&#8217;s what it is.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://pelicans.jetty.bot/runbooks#claude-sonnet/v4-v5" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!p-g3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1754c57-b2fb-41ab-bc36-71081f5392da_1982x1504.png 424w, https://substackcdn.com/image/fetch/$s_!p-g3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1754c57-b2fb-41ab-bc36-71081f5392da_1982x1504.png 848w, https://substackcdn.com/image/fetch/$s_!p-g3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1754c57-b2fb-41ab-bc36-71081f5392da_1982x1504.png 1272w, https://substackcdn.com/image/fetch/$s_!p-g3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1754c57-b2fb-41ab-bc36-71081f5392da_1982x1504.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!p-g3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1754c57-b2fb-41ab-bc36-71081f5392da_1982x1504.png" width="1456" height="1105" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d1754c57-b2fb-41ab-bc36-71081f5392da_1982x1504.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1105,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:375774,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;https://pelicans.jetty.bot/runbooks#claude-sonnet/v4-v5&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lebensold.substack.com/i/198865874?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1754c57-b2fb-41ab-bc36-71081f5392da_1982x1504.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!p-g3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1754c57-b2fb-41ab-bc36-71081f5392da_1982x1504.png 424w, https://substackcdn.com/image/fetch/$s_!p-g3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1754c57-b2fb-41ab-bc36-71081f5392da_1982x1504.png 848w, https://substackcdn.com/image/fetch/$s_!p-g3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1754c57-b2fb-41ab-bc36-71081f5392da_1982x1504.png 1272w, https://substackcdn.com/image/fetch/$s_!p-g3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1754c57-b2fb-41ab-bc36-71081f5392da_1982x1504.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Jetty&#8217;s agent is hill climbing to a better pelican runbook.</figcaption></figure></div><h2>The judge gap</h2><p>A gemini-cli run scored 40 out of 40 on the rubric. Pelican, bicycle, composition, polish &#8212; four perfect tens.</p><p>Then we re-judged the same SVG with Claude Sonnet. It came back at 37.</p><p>Same image. Same four-axis rubric. Different judge. Gemini Flash, used as the judge, awards full marks freely. Sonnet, scoring the same SVG against the same axes, caps out around 37 and rarely gives a perfect score. The temptation is to call one of them right. Neither of them is. They&#8217;re calibrated differently. Flash&#8217;s 40 isn&#8217;t the same number as Sonnet&#8217;s 40, and pretending otherwise is how leaderboards lie.</p><p>This is the practitioner problem with LLM-as-judge that nobody puts on the box. At some point you&#8217;ll want to compare scores across two evaluations &#8212; same rubric, different runs &#8212; and reach a conclusion. If the judge changed underneath you, even silently, the numbers are no longer on the same axis. You either lock the judge for the lifetime of the comparison, or you design the rubric so swapping doesn&#8217;t matter. Most rubrics don&#8217;t survive that test.</p><p>We didn&#8217;t fix it. We just made the gap visible.</p><h2>What this kind of artifact is for</h2><p>There&#8217;s a broader thing here that has nothing to do with pelicans.</p><p>Most benchmark posts give you a headline number and a screenshot. Sometimes a GitHub repo with the prompts. What you can&#8217;t do is replay the attempt, see what the model produced on round three before converging, or diff version four of the runbook against version seven and ask which edit moved the score. The interesting questions live in the trajectories, and the trajectories don&#8217;t exist.</p><p>When every iteration is captured, the benchmark becomes inspectable in the way code is inspectable. The runbook is the source. The trajectory is the build artifact. The diff between two runbook versions is a PR you can review, comment on, and revert.</p><p>The pelican project is small enough to fit on one page and weird enough to share. The structure underneath is the part worth stealing.</p>]]></content:encoded></item><item><title><![CDATA[Lights-Out Manufacturing Had a Brake Pedal]]></title><description><![CDATA[The lights are going off in software the same way they went off in manufacturing]]></description><link>https://blog.jetty.io/p/lights-out-manufacturing-had-a-brake</link><guid isPermaLink="false">https://blog.jetty.io/p/lights-out-manufacturing-had-a-brake</guid><dc:creator><![CDATA[Jonathan Lebensold]]></dc:creator><pubDate>Sat, 16 May 2026 14:26:49 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!oO_o!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4dce14b7-a173-4b7c-ac24-5e25b43716f3_1024x1024.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>An agent I was watching a six-step extraction pipeline. It made it to step three, hit a schema mismatch, and &#8220;recovered&#8221; by skipping to step five. Then it wrote a polished summary citing results from a step that never ran.</p><p>$54 in API spend. Three days for someone to figure out why the downstream dataset was corrupted. The trace was sitting in Langfuse the whole time.</p><p>I spent this week at Web Summit in Vancouver talking to engineering leaders driving AI adoption inside their companies. Every one of them had a version of this story &#8212; and the common thread was that the agent habits forming inside their dev environments are landing in production unchanged. There&#8217;s no staging environment for agent behavior. A skill that silently skips a step on a developer&#8217;s laptop silently skips it in production &#8212; at customer scale, with customer data.</p><p>This is the failure mode missing from the conversation about autonomous AI. Not &#8220;the model can&#8217;t do the task.&#8221; The model can do the task. The system around it doesn&#8217;t know when the task wasn&#8217;t done &#8212; and the same gap that costs $54 in development costs orders of magnitude more in production.</p><h2>The lights are going off anyway</h2><p>There&#8217;s a name for what the industry is building toward. The dark factory.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oO_o!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4dce14b7-a173-4b7c-ac24-5e25b43716f3_1024x1024.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oO_o!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4dce14b7-a173-4b7c-ac24-5e25b43716f3_1024x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!oO_o!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4dce14b7-a173-4b7c-ac24-5e25b43716f3_1024x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!oO_o!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4dce14b7-a173-4b7c-ac24-5e25b43716f3_1024x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!oO_o!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4dce14b7-a173-4b7c-ac24-5e25b43716f3_1024x1024.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oO_o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4dce14b7-a173-4b7c-ac24-5e25b43716f3_1024x1024.jpeg" width="1024" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4dce14b7-a173-4b7c-ac24-5e25b43716f3_1024x1024.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:989039,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://lebensold.substack.com/i/198002631?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4dce14b7-a173-4b7c-ac24-5e25b43716f3_1024x1024.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!oO_o!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4dce14b7-a173-4b7c-ac24-5e25b43716f3_1024x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!oO_o!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4dce14b7-a173-4b7c-ac24-5e25b43716f3_1024x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!oO_o!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4dce14b7-a173-4b7c-ac24-5e25b43716f3_1024x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!oO_o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4dce14b7-a173-4b7c-ac24-5e25b43716f3_1024x1024.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>At the foot of Mount Fuji, FANUC&#8217;s Oshino plant produces fifty industrial robots per twenty-four-hour shift and runs unmanned for up to thirty days at a stretch. Gary Zywiol, the company&#8217;s vice president, has a quote about it that&#8217;s now part of the canon: &#8220;Not only is it lights-out, we turn off the air conditioning and heat too.&#8221; Siemens runs lights-out plants. Xiaomi opened an 81,000-square-meter unmanned smartphone facility in 2024 that produces ten million units a year with no humans on shift. The <a href="https://ifr.org/">International Federation of Robotics</a> put China at over two million factory robots in 2024 &#8212; fifty-four percent of global demand. A <a href="https://journals.sagepub.com/doi/10.1177/09544054241305826">peer-reviewed survey in *Engineering Manufacture*</a> this year called lights-out factories &#8220;the pinnacle of manufacturing advancement.&#8221;</p><p>The software industry is racing toward the same idea from the opposite direction. BCG Platinion has been writing about <a href="https://www.bcgplatinion.com/insights/the-dark-software-factory">*The Dark Software Factory*</a> &#8212; AI agents handling planning, coding, testing, deployment, with no human on the PR. Stripe is <a href="https://www.lennysnewsletter.com/p/this-week-on-how-i-ai">already operating at 1,300-plus AI-authored pull requests per week</a>. The destination isn&#8217;t theoretical. The lights are going off.</p><p>Here&#8217;s what I keep coming back to about Oshino, Xiaomi, and every other plant where the lights are already off: they work because the operational half of the stack existed first.</p><h2>What made lights-out actually work</h2><p>Specified tolerances and statistical process control. QA loops you instrument before you turn off the lights. Shutdown criteria &#8212; what halts the line, what alarms, what gets quarantined. Calibration schedules nobody can override.</p><p>None of these are sexy. All of them existed before the humans left the floor. Olivier and Craig&#8217;s <a href="https://ieeexplore.ieee.org/document/8095515/">2017 IEEE AFRICON paper</a> <em>Lights-out process control &#8212; analysis and framework</em> puts it bluntly: every unmanned production line is wrapped in a deterministic decision layer that knows when to halt, alarm, retry, or quarantine. The robots aren&#8217;t the achievement. The discipline around the robots is.</p><p>The manufacturing industry spent forty years building that layer. There&#8217;s <a href="https://ieeexplore.ieee.org/document/8217070/">another IEEE paper from 2017</a>, <em>Industrial robotics in factory automation: from the early stage to the Internet of Things</em>, that traces the arc &#8212; from the first numerically-controlled machines through the SCADA era and on into the modern smart factory. The throughline is the same in every chapter. The operational layer always shows up before the lights go off. Not after.</p><p>AI is doing it backwards. The agents are loose in production. The operational primitives &#8212; observability that surfaces patterns instead of just collecting data, durable orchestration that survives a crash on step seventeen, end-to-end evals that catch regressions before customers do, output verification that knows what &#8220;done&#8221; looks like &#8212; are still being hand-built by every team independently.</p><p>Dark factories without observability and evaluation are motion without progress.</p><h2>What&#8217;s actually missing</h2><p>In May, Tobi Coker at Felicis published a survey of twenty-three AI-native engineering leaders that put numbers on this. <a href="https://www.felicis.com/blog/the-ai-stack-is-half-built">*The AI Stack Is Half-Built*</a>. It&#8217;s worth reading in full.</p><p>The headline that&#8217;s stayed with me: 69.6% of teams doubled their inference spend in six months. 34.8% now report inference running at five times their training cost. And 56.5% of those teams are managing the fastest-growing line item in modern software with home-built spreadsheets and custom dashboards. The infrastructure category that should exist doesn&#8217;t &#8212; yet.</p><p>47.8% of the surveyed teams are running autonomous agents in production. One respondent named the missing primitive plainly: &#8220;retry this 20-step agent run from step 10.&#8221; Nobody ships that. So teams duct-tape frameworks together. &#8220;Stitching it,&#8221; another engineer wrote, &#8220;still takes too much custom glue.&#8221;</p><p>45% named evaluation as their single biggest unsolved problem. The hard part isn&#8217;t testing individual LLM calls &#8212; it&#8217;s running evals across the entire system, end-to-end, with state and tool calls and branching. 57.9% would rather build their own eval framework than buy an existing one. Not because they want to. Because nothing they&#8217;ve found does what they need.</p><p>What&#8217;s striking isn&#8217;t any individual number. It&#8217;s that the three gaps describe the same hole from three angles. Observability that doesn&#8217;t surface patterns. Orchestration that can&#8217;t recover from a partial failure. Evaluation that can&#8217;t measure the system end-to-end. These aren&#8217;t three problems. They&#8217;re one problem &#8212; the missing operational layer &#8212; measured by three groups of people who don&#8217;t yet have a shared name for what they&#8217;re missing.</p><p>I&#8217;ve spent the last year auditing production Langfuse projects. We&#8217;ve crossed 100,000 traces across voice AI, agentic tooling, healthcare pipelines, extraction services. The patterns the Felicis survey describes aren&#8217;t visible only from the top down. They show up in every audit. Redundant API calls because nobody had a cache. System prompts resent on every turn because nobody had configured prompt caching. Five GPT-4o versions running simultaneously because pinning was easier than maintaining. Agentic workflows accumulating $50 in a single trace because the operational layer that should have summarized intermediate context wasn&#8217;t there.</p><p>The leaders in the survey and the production systems I see in audits are looking at the same thing. They just don&#8217;t have the vocabulary to coordinate.</p><h2>We&#8217;ve done this before</h2><p>The agile movement spent fifteen years building automated testing, feature flags, blue-green deploys, canaries, and continuous delivery before deploy-on-merge became safe enough to be the default. I lived through the front half of that arc. The pattern was always the same: a new way of shipping arrived, the old operational discipline didn&#8217;t fit, and the industry spent a decade reverse-engineering what &#8220;done&#8221; and &#8220;broken&#8221; should mean in the new model.</p><p>Big Bang Oracle releases. Friday-night dread. The rollback Saturday. The post-mortem Monday. The connection between that older pattern and the modern foundation-model-version-swap dread is direct, and I&#8217;ve <a href="https://jetty.io/p/ci-for-ai">written about it elsewhere</a>. The shape of the failure is the same. So is the shape of the fix.</p><p>The fix was never a new framework. It was operational discipline written down where everyone could read it. Test suites. Runbooks. Smoke tests. Verification scripts. The lift wasn&#8217;t in inventing new tools. It was in agreeing on what done meant and refusing to call something done until it passed.</p><h2>The contrarian bet</h2><p>Here&#8217;s what I think is the easy bet to miss right now.</p><p>Coker&#8217;s three gaps will eventually become VC-funded categories. Inference observability will get its Datadog. Agent orchestration will get its <a href="https://spinnaker.io/">Spinnaker</a>. End-to-end evaluation will get something &#8212; maybe two or three things &#8212; that finally does for AI what unit tests did for code. The market will sort it out. Some bets will pay off. Most won&#8217;t. The category-defining product for each of these gaps doesn&#8217;t exist yet, which is exactly why VCs are paying attention.</p><p>But the teams that pull ahead aren&#8217;t waiting for the category-defining product. They&#8217;re building the operational layer with what they already have.</p><p>And the operational primitive we keep returning to &#8212; across hundreds of customer trajectories and a year of audits &#8212; is a markdown file with a rubric and a bash verification script at the bottom.</p><p>That&#8217;s not a slogan. It&#8217;s what survives load. The sophistication isn&#8217;t in the format. It&#8217;s in what the document encodes: what &#8220;done&#8221; means, how to evaluate against it, what to do when it falls short, when to stop trying. We call this a <a href="https://jetty.io/p/runbooks-for-agents">runbook</a> because the older shape it most resembles is the operations runbook on the wall behind a NOC. But it&#8217;s also a test suite. And a spec. And a quality gate. And a refusal to call something done until it passes.</p><p>The contrast is funny. The industry is racing to build increasingly elaborate agent frameworks &#8212; graph planners, multi-agent orchestrators, memory layers, MCP everything &#8212; while the most reliable primitive in our toolbox is plain text with a checker. Markdown works with every agent. It&#8217;s version-controlled. It&#8217;s diffable. It&#8217;s reviewable in a pull request. It doesn&#8217;t need a runtime. It&#8217;s the lightest possible structure that still encodes the operational discipline a dark factory actually requires.</p><p>The dark factory is coming. The lights are going off in software the way they went off in manufacturing. But Oshino doesn&#8217;t run because the robots are smart. It runs because the deterministic decision layer around them never lets a bad part through.</p><p>If you&#8217;re building the AI version of that &#8212; and if you&#8217;re shipping agents into production, you are, whether you&#8217;re using that vocabulary or not &#8212; the operational half of the stack is your problem. Not Coker&#8217;s category-defining vendor. Not next quarter&#8217;s better orchestration framework. Yours. This week.</p><p>You need four pieces. The dashboard, which most teams already have. The runbook, which does the operational lift underneath. The PR, which is the artifact the team reviews and merges. And &#8212; bluntly &#8212; the eval. That&#8217;s your brake pedal.</p><p>Build those four. The rest of the stack can be half-built indefinitely.</p><p>What I keep wondering is where the ceiling is. If a markdown file with a checker is good enough for everything we&#8217;ve thrown at it so far, where does it break &#8212; and what does an agent task look like when plain text and a bash script stop being enough?</p>]]></content:encoded></item><item><title><![CDATA[Research Closes the Loop. Production Keeps Us In It.]]></title><description><![CDATA[Why we kept the outer loop open by design]]></description><link>https://blog.jetty.io/p/research-finishes-the-loop-production</link><guid isPermaLink="false">https://blog.jetty.io/p/research-finishes-the-loop-production</guid><dc:creator><![CDATA[Jonathan Lebensold]]></dc:creator><pubDate>Fri, 08 May 2026 10:16:29 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Zx1L!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd569ddba-4849-4359-84cc-1fc9594c5c79_1408x768.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>An ICLR Oral just showed that reflective prompt evolution beats reinforcement learning.</p><p>The paper, <a href="https://arxiv.org/abs/2507.19457">GEPA</a>, describes a search procedure where a language model reads its own rollouts and proposes prompt edits in plain English. The search uses those edits as its mutation operator. The result beats GRPO &#8212; the RL baseline from DeepSeekMath &#8212; across three benchmarks: HotpotQA, AIME, IFBench. Fewer rollouts. No reward model.</p><p>The paper is right. I want to say that first, because most of what follows could read like a disagreement, and it isn&#8217;t. The inner generate-judge-refine loop GEPA describes is the same loop a <a href="https://jetty.io/guides/runbooks">Jetty runbook</a> runs on every execution. The mechanism is real and it generalizes.What&#8217;s interesting isn&#8217;t whether GEPA validates the runbook approach. It does. What&#8217;s interesting is the part of the loop GEPA closes that we deliberately keep open.</p><h2>The inner loop is generator-verifier</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Zx1L!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd569ddba-4849-4359-84cc-1fc9594c5c79_1408x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Zx1L!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd569ddba-4849-4359-84cc-1fc9594c5c79_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Zx1L!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd569ddba-4849-4359-84cc-1fc9594c5c79_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Zx1L!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd569ddba-4849-4359-84cc-1fc9594c5c79_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Zx1L!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd569ddba-4849-4359-84cc-1fc9594c5c79_1408x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Zx1L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd569ddba-4849-4359-84cc-1fc9594c5c79_1408x768.jpeg" width="1408" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d569ddba-4849-4359-84cc-1fc9594c5c79_1408x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:953390,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://lebensold.substack.com/i/196616983?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd569ddba-4849-4359-84cc-1fc9594c5c79_1408x768.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Zx1L!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd569ddba-4849-4359-84cc-1fc9594c5c79_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Zx1L!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd569ddba-4849-4359-84cc-1fc9594c5c79_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Zx1L!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd569ddba-4849-4359-84cc-1fc9594c5c79_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Zx1L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd569ddba-4849-4359-84cc-1fc9594c5c79_1408x768.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>GEPA&#8217;s contribution is the reflection step. A language model reads its rollouts and identifies the failure pattern in natural language. That diagnosis becomes the mutation operator the search uses to propose the next prompt.</p><p>This isn&#8217;t a one-off result. It&#8217;s the third optimizer the DSPy team has shipped, after bootstrap-fewshot and <a href="https://arxiv.org/abs/2406.11695">MIPROv2</a>. The closest methodological cousin is <a href="https://arxiv.org/abs/2406.07496">TextGrad</a> &#8212; backpropagating textual feedback through compound LLM systems. The evolutionary ancestor is <a href="https://arxiv.org/abs/2309.16797">Promptbreeder</a>. Three years of lineage, not a single ICLR moment.</p><p>The thing that makes any of it work is generator-verifier asymmetry. <a href="https://lebensold.substack.com/p/generation-got-cheap-verification">I&#8217;ve written about this before</a> &#8212; producing a good answer is hard; recognizing one is cheap; the gap between the two is the only reason iterative loops ever converge. The verifier &#8212; a rubric, a benchmark, a unit test &#8212; has to stay independent of the thing generating candidates. Without that independence, you&#8217;re not climbing a gradient. You&#8217;re staring into a mirror.</p><p>Our runbooks run this exact loop on every execution. The rubric step (&#8221;what does done look like, how do we check&#8221;) is the verifier. The bounded retry budget is the search. The agent reads its rubric failure and tries again. It stops when the rubric clears or the budget runs out. That&#8217;s GEPA&#8217;s inner loop, applied to the artifact instead of the prompt.</p><h2>Two lenses on the outer loop</h2><p>The split between GEPA and runbooks occurs on what happens <em>across</em> runs, not within one: research optimizes for &#8220;did the score go up.&#8221; Production optimizes for &#8220;did the right people see the change before it shipped.&#8221;</p><p>In research you have a benchmark. The benchmark is ground truth. Closing the loop by auto-evolving the prompt against rollout data is unambiguously good. The eval signal is trustworthy and the cost of a regression is &#8220;the number went down on a graph.&#8221; However, in production you have a runbook running brand-compliance review on every marketing draft. The rubric is your team&#8217;s judgment compressed into prose. The eval signal is a proxy. It encodes what <em>you decided</em> &#8220;good&#8221; means three months ago.</p><p>Auto-evolving against that proxy means the runbook can drift in a direction nobody on the team noticed. Worse: if the system can rewrite the rubric <em>and</em> optimize against it, the verifier stops being independent of the generator. The asymmetry that made the inner loop work breaks down. You haven&#8217;t built a self-improving system. You&#8217;ve built a self-justifying one.</p><h2>We could automate this. We chose not to.</h2><p>This is the load-bearing point.</p><p>Jetty has all the parts. <a href="https://github.com/jettyio/jettyio-skills">`/optimize-runbook`</a> already reads trajectories and proposes runbook edits. Routines fire on a cron. Wiring up nightly auto-evolution against last week&#8217;s trajectories is a few config lines. The trajectory storage GEPA&#8217;s approach needs as a substrate is already first-class.</p><p>We didn&#8217;t ship that as the default. What ships is a runbook in git. Someone runs /optimize-runbook when they suspect drift. The diff lands in a PR like any other change. Closed loop available; open loop default.</p><p>This isn&#8217;t theoretical. Decagon ran GEPA on real customer-service prompts and <a href="https://decagon.ai/blog/optimizing-gepa-for-production">wrote up what happened</a>. Naive runs produced ~5,000-character prompts that overfit to small reflection sets. Smaller reflection models broke outright. The optimal sample range turned out to be 20&#8211;100, not &#8220;more is better.&#8221;</p><p>Their fix amounted to code review for prompts. They added a holdout set so the optimizer couldn&#8217;t lie about its own progress. Length regularization stopped the prompts from sprawling. The whole optimization started getting treated like the test-driven engineering loops everyone already trusts for shipping software.</p><p>That&#8217;s the production reality the academic version doesn&#8217;t have to think about.</p><h2>The merge step is the feature</h2><p>Production runbooks aren&#8217;t AIME problems. They&#8217;re operational artifacts that govern real outputs to real people, and the reasons to keep a human in the merge step are the same reasons engineering teams keep humans in the merge step for code.</p><p>Someone has to know the rubric tightened. Otherwise the team is debugging &#8220;why did our outputs change last Tuesday&#8221; with no change to point at. A runbook that silently rewrites itself can&#8217;t be pinned to a sprint or to an experiment, much less an audit window &#8212; you lose the ability to say &#8220;this output came from runbook v1.3.&#8221; When the runbook ships a bad output, &#8220;who changed this and why&#8221; needs an answer. A diff in git with a reviewer attached has that answer. A self-evolving prompt does not.</p><p>This is the <a href="https://lebensold.substack.com/p/foundation-models-ship-like-windows">PR-is-the-product</a> thesis applied to the runbook itself. The runbook diff is the artifact your team reviews. The fact that an agent <em>proposed</em> the diff doesn&#8217;t change who owns the merge.</p><p>Engineers already trust this interface. Auto-formatters write code. Linters fix style. Your test runner can fail a build with no human signoff at all. All closed loops. Auto-merge to main? Not yet. Branch protection, CODEOWNERS, required reviewers &#8212; that&#8217;s the merge-policy interface that makes every other closed loop safe to ship. Without it, every commit is an unsupervised optimization against an untrusted verifier.</p><h2>What&#8217;s actually open</h2><p>GEPA&#8217;s claim isn&#8217;t wrong for its domain. If you&#8217;re tuning a prompt against a benchmark with trustworthy ground truth, closing the loop is the right move. The literature is also more contested than the ICLR headline suggests and <a href="https://benanderson.work/blog/contra-dspy-gepa/">Benjamin Anderson&#8217;s structural critique</a> argues agents don&#8217;t have the locality property modular optimization assumes.</p><p>The thread to pull on is narrower. Where in the <a href="https://lebensold.substack.com/p/runbooks-what-agents-need-to-hill">runbook lifecycle</a> does the loop close, and where does the human stay? The honest answer depends on the cost of a silent regression. A runbook that drafts internal Slack messages can probably auto-merge proposed changes. A runbook that touches customer comms cannot. The interesting design question isn&#8217;t &#8220;should we automate.&#8221; It&#8217;s: what&#8217;s the merge-policy interface for runbooks, and what does it look like to declare it explicitly?</p>]]></content:encoded></item><item><title><![CDATA[Patterns Were the Map in the Search for Beauty]]></title><description><![CDATA[Christopher Alexander told the patterns community they missed the point. Thirty years later, agents finally let us listen.]]></description><link>https://blog.jetty.io/p/patterns-were-the-map-in-the-search</link><guid isPermaLink="false">https://blog.jetty.io/p/patterns-were-the-map-in-the-search</guid><dc:creator><![CDATA[Jonathan Lebensold]]></dc:creator><pubDate>Tue, 28 Apr 2026 17:18:34 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!vNFz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda08f1a9-95e1-431a-97a8-92887e15d4e4_1408x768.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In October 1996, Christopher Alexander gave the keynote at OOPSLA &#8212; the conference where object-oriented programmers gathered every year to argue about software design. He had been invited as the patron saint of the patterns movement. His 1977 book <em>A Pattern Language</em> was the conceptual foundation the Gang of Four had drawn on for <em>Design Patterns</em> a few years earlier. The whole field was, in a real sense, his.</p><p>He used the keynote to tell them they had missed the point.</p><p>The form of the pattern &#8212; name, context, problem, solution &#8212; was something software had borrowed cleanly. What had been left behind was the <em>purpose</em>. Patterns weren&#8217;t taxonomies. They were instruments for generating a quality Alexander was by then comfortable calling <em>beauty</em>: the thing that makes a courtyard feel inevitable, or a doorway feel right at five in the morning.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vNFz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda08f1a9-95e1-431a-97a8-92887e15d4e4_1408x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vNFz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda08f1a9-95e1-431a-97a8-92887e15d4e4_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!vNFz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda08f1a9-95e1-431a-97a8-92887e15d4e4_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!vNFz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda08f1a9-95e1-431a-97a8-92887e15d4e4_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!vNFz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda08f1a9-95e1-431a-97a8-92887e15d4e4_1408x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vNFz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda08f1a9-95e1-431a-97a8-92887e15d4e4_1408x768.jpeg" width="1408" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/da08f1a9-95e1-431a-97a8-92887e15d4e4_1408x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1123430,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://lebensold.substack.com/i/195773298?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda08f1a9-95e1-431a-97a8-92887e15d4e4_1408x768.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vNFz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda08f1a9-95e1-431a-97a8-92887e15d4e4_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!vNFz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda08f1a9-95e1-431a-97a8-92887e15d4e4_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!vNFz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda08f1a9-95e1-431a-97a8-92887e15d4e4_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!vNFz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda08f1a9-95e1-431a-97a8-92887e15d4e4_1408x768.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>He told the room he hoped they would <a href="https://www.patternlanguage.com/archive/ieee/ieeetext.htm">eventually get there</a>. They mostly didn&#8217;t. The community kept refining the form &#8212; anti-patterns, pattern catalogs, books and books of templates &#8212; and Alexander spent the next decade writing <a href="https://www.natureoforder.com/">The Nature of Order</a>, four volumes on the structure of beauty in built systems. The two camps never really reconnected.</p><h2>The thing patterns can&#8217;t capture</h2><p>Engineers spent two decades cataloguing every shape they could find: Singleton, Observer, Strategy, Chain of Responsibility. Enterprise software followed the same logic at the next level up. ERP. CRM. Document management. Workflow engine. Identity provider. Each one a discrete category with its own pattern language and its own vendor.</p><p>Then ask any senior engineer to describe a real production system. You get something stranger.</p><p>A claims-processing system at a regional health insurer is, on inspection, maybe 30% ERP &#8212; it tracks line items, holds money, runs against an accounting period. 40% CRM &#8212; it tracks individuals, decisions about them, communications history. 30% document management &#8212; every claim has scanned attachments, every appeal has a paper trail. The categories vendors use to <em>sell</em> software don&#8217;t describe the categories work <em>operates</em> in.</p><p>Every senior engineer I know has lived this. The vendor sells you the ERP. You implement it. Six months in, the shape of the work isn&#8217;t ERP-shaped &#8212; it&#8217;s something else the vendor&#8217;s data model can&#8217;t quite hold &#8212; and the spreadsheets and Slack channels start filling in the rest.</p><h2>A partition is not a whole</h2><p>It&#8217;s tempting to stop there and say: real systems are blends of categories, and patterns don&#8217;t catch blends. That framing concedes too much. A blend implies a partition &#8212; that the system can be cleanly carved up into 30% of this and 40% of that, summing to 100%. It still gives the categories the dignity of being the right axes.</p><p>That&#8217;s not what&#8217;s actually happening. The ERP-ness of the claims system is <em>not separate</em> from the CRM-ness. The line items only make sense alongside the individual whose decision created them. The communications history hangs on the document trail. And the document trail is paper stapled to nothing without the line items it justifies. The categories don&#8217;t slice a pie. They are codependent centers, and the codependence <em>is</em> the system.</p><p>Alexander had a name for this. In <em>The Nature of Order</em> he describes <a href="https://www.natureoforder.com/Book1">fifteen fundamental properties</a> that he claims contribute to the wholeness of a built artifact &#8212; <em>strong centers</em>, <em>deep interlock and ambiguity</em>, <em>not-separateness</em>, among others. Wholeness, in his framework, is what happens when the parts of a system strengthen each other rather than tile. A beautiful courtyard is not a sum of wall and bench and tree. It&#8217;s a configuration where the wall makes the bench feel like a place to sit, the tree makes the wall feel like shelter &#8212; and somehow, the bench makes the tree feel placed rather than planted.</p><p>A claims-processing system is the same kind of thing, when it works. When it doesn&#8217;t, you bought three vendors and integrated them, and you got something that diagrams cleanly and operates badly. The seams show. The work happens in spreadsheets.</p><p>This is why patterns were a dead end. A pattern is, by construction, a separable unit. It can be named, lifted out of context, and reused. That&#8217;s the entire point. But the property Alexander cared about &#8212; wholeness, beauty, deep interlock &#8212; is exactly the property that does not survive separation.</p><h2>Map and terrain</h2><p>There&#8217;s a phrase from <a href="https://en.wikipedia.org/wiki/Map%E2%80%93territory_relation">general semantics</a> &#8212; the map is not the territory &#8212; that captures the same point operationally. A pattern is a map. The named, separable thing the catalog points at. The terrain is the system that has to actually run, with all its interpenetrating centers and codependencies and exceptions.</p><p>Workflow software is where this gets expensive. A workflow encodes a map: this step, then that step, then this branch, then that approval. It assumes the work decomposes into known stages in a known order. Where the work fits the map, that&#8217;s industrial-strength leverage. Where it doesn&#8217;t &#8212; where the operation is interpenetrating rather than sequential &#8212; the workflow forces the terrain into the shape of the map. That&#8217;s where you get the joke at every enterprise: there&#8217;s the system everyone officially uses, and the spreadsheet where the work actually gets done.</p><p>Workflow software didn&#8217;t create that gap. It industrialized it. It encoded the map into the system itself, which made a bad map cheaper to enforce and harder to bend.</p><p><a href="https://jasonstanley.substack.com/p/workflows-were-built-for-a-specific">Jason Stanley</a> writes about this in governance terms. He sorts work governance into four modes &#8212; procedural (verify the steps), credentialing (verify the actor), gatekeeping (verify the output), reputational (trust accumulated over time). Alexander&#8217;s own practice was reputational: clients on the Eishin School trusted him because he&#8217;d built things they recognized as alive. That mode doesn&#8217;t transfer to a non-human builder. When an agent walks onto the site, governance has to fall back to gatekeeping &#8212; verifying the output instead of trusting the actor.</p><h2>Planting flags</h2><p>Alexander&#8217;s practice on his bigger projects &#8212; the Eishin School outside Tokyo, the Mexicali housing &#8212; was not to draw a blueprint and hand it to builders. He walked the site. He planted flags where buildings would go. The plan came out of the topology, not the other way around. He was, in his own vocabulary, finding the centers &#8212; the places the site <em>wanted</em> a building &#8212; and letting the buildings emerge from there.</p><p>This is the move I think agents finally make available on the software side.</p><p>The way you hand an agent a task whose right answer depends on the specific terrain of your data, your customers, your existing systems is not by writing a workflow. It&#8217;s by giving it a <a href="https://lebensold.substack.com/p/runbooks-what-agents-need-to-hill">runbook</a> &#8212; a quality bar in markdown &#8212; and a rubric that measures the gap between output and bar, then letting it figure out the procedure. The procedural details &#8212; step order, retries, routing &#8212; stop being something a human has to specify in advance. You plant flags. The agent fills in the building. This is the bet I&#8217;ve been making with Jetty.</p><p>A journalist isn&#8217;t measured on her first draft. A designer isn&#8217;t graded on her sketches. They&#8217;re judged on what comes out the other side, against a standard their editor or art director can articulate. We&#8217;ve governed knowledge work this way for a hundred years. We&#8217;ve never been able to govern <em>software systems</em> this way, because the systems couldn&#8217;t read the standard. Now they can.</p><h2>What&#8217;s a beautiful system, then?</h2><p>I don&#8217;t have a clean answer. I keep running into it without recognizing it. A workflow that handles the easy 80% and falls back to a runbook for the rest. An evaluation pipeline whose rubric makes the codependence of the centers visible. An agent that produces a pull request that fits the codebase as if it had always been there. None specified by category. All specified by outcome.</p><p>Markdown rubrics work for a single task. The honest open question is what the outcome specification looks like at the level of a whole system. A whole organization.</p><p>I suspect Alexander would say it&#8217;s some version of beauty, and that we&#8217;d recognize it when we saw it. The patterns weren&#8217;t going to get us there. We&#8217;ve had thirty years to find out.</p>]]></content:encoded></item><item><title><![CDATA[My Backend is 442 Lines of Markdown]]></title><description><![CDATA[We shipped a web app whose entire backend is a structured document]]></description><link>https://blog.jetty.io/p/my-backend-is-442-lines-of-markdown</link><guid isPermaLink="false">https://blog.jetty.io/p/my-backend-is-442-lines-of-markdown</guid><dc:creator><![CDATA[Jonathan Lebensold]]></dc:creator><pubDate>Tue, 21 Apr 2026 03:37:36 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!S2OW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F288df42c-9025-4f5f-a6d6-65291c449e48_1408x768.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>A few weeks ago we shipped <a href="https://mlcroissant.jetty.bot">mlcroissant.jetty.bot</a> &#8212; paste in a URL to an academic dataset paper, and get back an <a href="https://mlcommons.org/croissant/">MLCommons Croissant</a> metadata file. It extracts the dataset&#8217;s provenance, fields, and license, then packages everything as machine-readable JSON-LD. The backend is a 442-line markdown file. No pipeline code. No DAG. No orchestrator. One structured document telling an agent what to do, plus a single API call to run it. The runbook is public at <a href="https://mlcroissant.jetty.bot/runbook">mlcroissant.jetty.bot/runbook</a>. The repo is at <a href="http://github.com/jettyio/pdf2croissant">github.com/jettyio/pdf2croissant</a>.</p><p><em>Run returned &#8220;completed&#8221; but produced no files.</em></p><p>The runbook was correct. The plumbing was broken. I couldn&#8217;t see which until I had the trajectory data showing exactly where the agent stopped. That failure taught me more about writing reliable agent instructions than any blog post about agent reliability ever has.</p><p>I&#8217;ve written before about <a href="https://lebensold.substack.com/p/runbooks-what-agents-need-to-hill">runbooks as the missing layer</a> between &#8220;call this API&#8221; and &#8220;accomplish this outcome&#8221; &#8212; the structured document you&#8217;d write for a competent new hire who needs to run your pipeline while you&#8217;re on vacation. This is what that looks like when you actually build something with one, failures included.</p><h2>353 lines was not enough</h2><p>The first version ran to 353 lines. The agent would read the paper, sometimes generate valid JSON-LD, and declare victory. One output file out of three. Sometimes zero. Status: completed.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!S2OW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F288df42c-9025-4f5f-a6d6-65291c449e48_1408x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!S2OW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F288df42c-9025-4f5f-a6d6-65291c449e48_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!S2OW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F288df42c-9025-4f5f-a6d6-65291c449e48_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!S2OW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F288df42c-9025-4f5f-a6d6-65291c449e48_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!S2OW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F288df42c-9025-4f5f-a6d6-65291c449e48_1408x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!S2OW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F288df42c-9025-4f5f-a6d6-65291c449e48_1408x768.jpeg" width="1408" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/288df42c-9025-4f5f-a6d6-65291c449e48_1408x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1005522,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://lebensold.substack.com/i/194868765?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F288df42c-9025-4f5f-a6d6-65291c449e48_1408x768.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!S2OW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F288df42c-9025-4f5f-a6d6-65291c449e48_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!S2OW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F288df42c-9025-4f5f-a6d6-65291c449e48_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!S2OW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F288df42c-9025-4f5f-a6d6-65291c449e48_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!S2OW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F288df42c-9025-4f5f-a6d6-65291c449e48_1408x768.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p><strong>Iteration 1:</strong> I added a verification script. After generating the Croissant file, the agent runs bash that checks: does the file exist? Is it valid JSON? Does it schema-validate? Each check prints PASS or FAIL.</p><p>The agent would check its work, see FAIL printed right there, and declare success anyway.</p><p><strong>Iteration 2 felt undignified but worked.</strong> I added aggressive mandate language: &#8220;MANDATORY &#8212; do not skip. You MUST produce all output files. No exceptions.&#8221; This sounds like yelling at a computer. Mechanically, though, it&#8217;s not about tone &#8212; it&#8217;s about probability mass. Agents sample from a distribution of reasonable next actions. Stronger language shifts the distribution. &#8220;Declare victory despite the FAIL line&#8221; becomes less probable; &#8220;fix the problem&#8221; becomes more probable.</p><p><strong>Iteration 3:</strong> converted prose parameter descriptions to structured tables. The first version had a paragraph describing what a Croissant file should contain. The agent would fill in what it could find and stop when the description got vague. The table version listed every field explicitly. No ambiguity left to exploit.</p><p>Here&#8217;s an example output from <a href="https://mlcroissant.jetty.bot/run/a23384eb">the runbook transforming CoralVQA into Croissant</a>. <em>The task didn&#8217;t get harder, instead the instructions got more precise.</em></p><h2>The meta-loop</h2><p>After enough runs to see failure patterns, I started using Jetty&#8217;s <code>/optimize-runbook</code> command to accelerate the cycle.</p><p><code>/optimize-runbook</code> reads execution trajectories from previous runs and proposes targeted changes to the runbook &#8212; not abstract suggestions but specific findings: &#8220;4 out of 7 runs failed at schema validation because the agent omitted <code>containedIn</code> on FileSet objects. Add it to the Common Fixes table.&#8221; Or: &#8220;There&#8217;s a nested shell quoting bug in the verification script. The agent&#8217;s <code>jq</code> call fails when the dataset name contains a comma.&#8221;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FKOG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4625ca9-64eb-4afd-abc6-c52169c02bac_1448x2064.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FKOG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4625ca9-64eb-4afd-abc6-c52169c02bac_1448x2064.png 424w, https://substackcdn.com/image/fetch/$s_!FKOG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4625ca9-64eb-4afd-abc6-c52169c02bac_1448x2064.png 848w, https://substackcdn.com/image/fetch/$s_!FKOG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4625ca9-64eb-4afd-abc6-c52169c02bac_1448x2064.png 1272w, https://substackcdn.com/image/fetch/$s_!FKOG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4625ca9-64eb-4afd-abc6-c52169c02bac_1448x2064.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FKOG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4625ca9-64eb-4afd-abc6-c52169c02bac_1448x2064.png" width="1448" height="2064" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b4625ca9-64eb-4afd-abc6-c52169c02bac_1448x2064.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2064,&quot;width&quot;:1448,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:283560,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lebensold.substack.com/i/194868765?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4625ca9-64eb-4afd-abc6-c52169c02bac_1448x2064.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FKOG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4625ca9-64eb-4afd-abc6-c52169c02bac_1448x2064.png 424w, https://substackcdn.com/image/fetch/$s_!FKOG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4625ca9-64eb-4afd-abc6-c52169c02bac_1448x2064.png 848w, https://substackcdn.com/image/fetch/$s_!FKOG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4625ca9-64eb-4afd-abc6-c52169c02bac_1448x2064.png 1272w, https://substackcdn.com/image/fetch/$s_!FKOG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4625ca9-64eb-4afd-abc6-c52169c02bac_1448x2064.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The MLCommons Croissant Runbook&#8212;taking PDFs and turning them into JSON</figcaption></figure></div><p></p><p>Instead of noodling through logs for thirty minutes, the skill found it in trajectory data in seconds.</p><p>What struck me is that this loop is structurally identical to what the runbook asks the agent to do: run, evaluate against criteria, identify the weakest point, make targeted fixes, run again. The difference is just which landscape you&#8217;re on. The agent is hill-climbing the output quality. I&#8217;m hill-climbing the instructions.</p><p>Same algorithm, one level up.</p><h2>The responsible AI angle</h2><p>Here&#8217;s why I think this matters beyond the exercise of building a web app with a markdown backend.</p><p>Most ML datasets ship with a PDF and maybe a HuggingFace card. Human-readable. Not machine-readable. If you want to audit what a model was trained on &#8212; provenance, licensing, known limitations, bias documentation &#8212; and the answer lives in hundreds of PDFs in different formats, you can&#8217;t do that audit at scale. You can gesture at it.</p><p>The <a href="https://artificialintelligenceact.eu/">EU AI Act</a> requires documentation of training data for high-risk AI systems. The <a href="https://www.nist.gov/artificial-intelligence/ai-risk-management-framework">NIST AI RMF</a> points in the same direction. Structured dataset documentation is becoming a compliance requirement, not just good practice. Croissant is the format that makes that compliance automatable rather than a documentation project someone has to own and maintain by hand.</p><p>The gap is real. Tens of thousands of dataset papers. Almost none of them have Croissant files. The research exists; the structured representation of it mostly doesn&#8217;t. As I&#8217;ve argued in <a href="https://lebensold.substack.com/p/foundation-models-ship-like-windows">my post on CI for AI</a>, the infrastructure for treating AI systems rigorously has to exist before you can use it &#8212; and machine-readable dataset metadata is about as foundational as it gets.</p><h2>The scale question</h2><p>Now I&#8217;m wondering how we can efficiently process entire conference proceedings; 3,000+ academic papers and datasets become a single data point for structured question answering. That&#8217;s the next experiment I&#8217;m thinking about.The Model Arbitrage Opportunity</p><p>The final, and perhaps most valuable, angle to this approach is model arbitrage. By defining the agent&#8217;s task in a highly structured, portable runbook (a markdown file, in this case), we decouple the instruction from the specific agent or model that executes it.</p><p>The <strong>RUNBOOK.md</strong> becomes the singular, high-quality asset. I used a capable, often more expensive model to <em>write</em> and <em>optimize</em> the runbook&#8212;leveraging its superior reasoning to find failure patterns and refine the instructions over dozens of runs. This is the <strong>high-cost, high-value authoring phase</strong>.</p><p>Once the runbook is reliable, as our 442-line version is, the execution cost plummets. The refined runbook can be tested, evaluated, and  executed by a smaller, cheaper, and faster model or agent that can simply follow the explicit, crystal-clear instructions. The runbook&#8217;s aggressive mandate language, structured tables, and explicit verification steps remove the need for constant high-level reasoning. The expensive model creates the reliable path; the cheaper model walks it.</p><p>This enables a clear model arbitrage strategy: <strong>pay for peak intelligence once to create the robust instructional layer, and then deploy for a fraction of the cost</strong> across thousands of execution runs, achieving reliability without the continuous expense of a top-tier model. It turns the runbook into a universal instruction set, allowing us to swap the underlying agent&#8212;the model&#8212;as new, cheaper, or more specialized options become available.</p>]]></content:encoded></item><item><title><![CDATA[The Jagged Frontier Is an Evaluation Problem]]></title><description><![CDATA[Why your AI system breaks in ways your evals won't catch]]></description><link>https://blog.jetty.io/p/the-jagged-frontier-is-an-evaluation</link><guid isPermaLink="false">https://blog.jetty.io/p/the-jagged-frontier-is-an-evaluation</guid><dc:creator><![CDATA[Jonathan Lebensold]]></dc:creator><pubDate>Mon, 13 Apr 2026 21:14:50 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!MU6A!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f0789f1-f889-4c92-8f0d-ceb6ccffac43_1408x768.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I was listening to The Economist&#8217;s <a href="https://shows.acast.com/boss-class">Boss Class podcast</a> a few months ago when I heard Ethan Mollick describe something I&#8217;d seen a dozen times but never had a name for. He called it <a href="https://www.oneusefulthing.org/p/centaurs-and-cyborgs-on-the-jagged">the jagged frontier</a>; the idea that AI capabilities don&#8217;t advance in a smooth, predictable line. A system that can write a compelling 2,000-word analysis of a balance sheet might fail to add up a column of five numbers. Not because it&#8217;s generally capable or generally incapable. Because its competence boundary is jagged. Full of unexpected peaks and cliffs.</p><p>I immediately thought of a team I&#8217;d worked with that built a financial analysis tool. Impressive thing &#8212; it could compare revenue multiples across a peer group, flag accounting anomalies, summarize MD&amp;A sections in plain English. Genuinely useful. Then someone asked it to add up the quarterly earnings figures it had just pulled. It got the wrong answer. Not by a little. By a lot. The model had no idea why this was embarrassing.</p><h2>Smooth vs. jagged</h2><p>When humans have expertise, that expertise tends to be correlated. An accountant who&#8217;s excellent at financial analysis is probably decent at arithmetic. Not because accounting requires arithmetic (it does), but because the skills develop together, from the same training, in the same career. Competence in one area is a weak signal of competence in adjacent areas.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MU6A!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f0789f1-f889-4c92-8f0d-ceb6ccffac43_1408x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MU6A!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f0789f1-f889-4c92-8f0d-ceb6ccffac43_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!MU6A!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f0789f1-f889-4c92-8f0d-ceb6ccffac43_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!MU6A!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f0789f1-f889-4c92-8f0d-ceb6ccffac43_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!MU6A!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f0789f1-f889-4c92-8f0d-ceb6ccffac43_1408x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MU6A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f0789f1-f889-4c92-8f0d-ceb6ccffac43_1408x768.jpeg" width="1408" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4f0789f1-f889-4c92-8f0d-ceb6ccffac43_1408x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MU6A!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f0789f1-f889-4c92-8f0d-ceb6ccffac43_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!MU6A!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f0789f1-f889-4c92-8f0d-ceb6ccffac43_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!MU6A!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f0789f1-f889-4c92-8f0d-ceb6ccffac43_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!MU6A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f0789f1-f889-4c92-8f0d-ceb6ccffac43_1408x768.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>AI systems don&#8217;t work this way. They have no such correlation. A model trained on vast amounts of financial text learns comparative analysis from millions of examples of comparative analysis. Whether it can add numbers depends entirely on whether the training process developed that particular capability &#8212; which is a separate question, running on a separate axis.</p><p>This is counterintuitive in a specific, dangerous way. When you hire a financial analyst and they produce excellent comparative analysis, you trust their arithmetic. That trust is almost always warranted. When you deploy an AI system that produces excellent comparative analysis, you might make the same inference. That inference is not warranted. And you won&#8217;t know it isn&#8217;t warranted until something goes wrong.</p><h2>The BCG study</h2><p>In 2023, Ethan Mollick and his colleagues at Harvard <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4573321">ran an experiment with 758 BCG consultants</a>. Pre-registered, large sample &#8212; the kind of study that HN usually dismisses and couldn&#8217;t in this case. Consultants were assigned tasks either inside or outside AI&#8217;s frontier.</p><p>Inside the frontier: <strong>AI-assisted consultants produced work that was about 40% higher quality</strong>. They finished 25% faster. The productivity gains were real and substantial.</p><p>Outside the frontier &#8212; tasks that looked superficially similar but fell outside what the AI could actually do well &#8212; a different story. AI-assisted consultants made 19% more errors than their unassisted counterparts. Despite working faster. The tool was confidently wrong in ways the consultants didn&#8217;t catch, because the outputs looked like outputs that should be right.</p><p>That&#8217;s the jagged frontier in a controlled experiment. The <strong>productivity gain and the quality loss exist simultaneously</strong>, for different tasks, in the same system.</p><h2>Three zones, not two</h2><p>The intuitive mental model for AI deployment is binary: tasks where AI helps, tasks where it doesn&#8217;t. Deploy where it helps, don&#8217;t deploy where it doesn&#8217;t. But the BCG study points to a third zone that&#8217;s more dangerous than either. Not &#8220;AI outperforms humans&#8221; and not &#8220;humans outperform AI&#8221; &#8212; there&#8217;s a third zone of tasks nobody thought to measure, where AI fails in ways nobody anticipated. Tasks that feel similar to the ones AI is good at. Tasks the team didn&#8217;t benchmark because they assumed they were covered. Tasks that only surface as failures once the system is in production.</p><p>This is the zone that catches teams off guard. Not the obvious failures. The non-obvious ones &#8212; the tasks that sit adjacent to your evals, just outside the mapped territory, invisible until a user finds them.</p><p>I think about the car wash story that circulated a while back. Someone asked an AI assistant to navigate them to the nearest car wash, and the model gave walking directions. Fifty meters. The model couldn&#8217;t reason about the context &#8212; that someone asking about a car wash is almost certainly in a car, and nobody walks their car to a car wash. To the model, it was a navigation question. It answered the navigation question. It was completely wrong about what the user actually needed.</p><p>Funny as an anecdote but not when that&#8217;s the behavior from your customer support agent, your document processor, your financial analysis tool, doing the same thing to real users.</p><h2>Why this is an evaluation problem</h2><p>Here&#8217;s the structural issue. The third zone &#8212; the unmapped dangerous zone &#8212; is invisible to standard evaluation because you don&#8217;t know how to test it.</p><p>Most teams build their eval sets before deployment. You think about what the system should do, you write test cases, you run them against the system, you measure quality. What you&#8217;re measuring is performance on the tasks you thought of. The jagged frontier bites you on the tasks you didn&#8217;t think of.</p><p>This is another reason why static gold datasets fail &#8212; I&#8217;ve written about this before. Gold datasets can only cover the frontier as it existed when you built them. But the frontier shifts with every model update. A task that was inside your capability boundary with GPT-4 might be outside it with GPT-4o. Or vice versa. <a href="https://lebensold.substack.com/p/stop-building-against-gold-datasets">Your static benchmarks tell you nothing about the new cliffs</a>.</p><p>And it gets worse with long-running workflows. Atomic tasks are easier to evaluate &#8212; you get an output, you check it. Multi-step workflows can look fine at every intermediate stage and fail at the aggregate. The jagged frontier can bite you at step three of a six-step process, and you won&#8217;t see it in your step-level metrics. You&#8217;ll see it in your support tickets.</p><h2>Example: call centers</h2><p>Call centers have been dealing with a version of this problem for decades. Not AI, but agents &#8212; human ones &#8212; whose competence is inconsistent and whose performance needs to be measured task by task, not just on average.</p><p>Good call center operations don&#8217;t measure &#8220;are our agents generally good.&#8221; They measure resolution rate per call type, handle time per call type, customer satisfaction per call type, escalation rate per call type. Per agent. Per shift. Per product line. They know, with specificity, which agents handle billing disputes well and which ones need support for technical escalations. The evaluation infrastructure is granular enough to find the edges.</p><p>This is what good AI evaluation looks like. Not &#8220;our system is generally performing at 87% quality&#8221; &#8212; that number hides the jagged frontier. You need quality broken down by task type, by input category, by workflow stage. You need to know where the peaks are and where the cliffs are. And you need to update that map every time the system changes, because the frontier shifts.</p><p>The challenge is that most teams don&#8217;t have this infrastructure. They have aggregate metrics. Maybe a few spot checks. An eval set that hasn&#8217;t been refreshed in three months. When a failure surfaces, they discover it the way call centers discovered problems before they had analytics &#8212; through a supervisor listening in on a bad call, or through a customer who complained loudly enough.</p><h2>What grounded evaluation requires</h2><p>Mollick uses the phrase <a href="https://www.oneusefulthing.org/p/management-as-ai-superpower">&#8220;grounded quality definitions&#8221;</a> in his work on management as an AI superpower, and it&#8217;s worth unpacking. A grounded definition ties quality to real outcomes, not to the output&#8217;s surface characteristics. Not &#8220;did the response sound confident&#8221; or &#8220;was the response coherent&#8221; but &#8220;did the user accomplish what they were trying to accomplish&#8221; or &#8220;was the calculation correct.&#8221;</p><p>This turns out to be harder than it sounds, and it&#8217;s where most eval infrastructure falls down. It&#8217;s easy to build evals that measure proxy metrics &#8212; coherence, length, similarity to a reference answer. It&#8217;s harder to build evals that measure whether the system is actually doing the right thing.</p><p>Two requirements that I keep coming back to. <strong>Quality has to be grounded:</strong> tied to real outcomes, not proxy signals. And <strong>task completion has to be verifiable</strong>: you can actually check whether it happened, independent of the system&#8217;s confidence. A financial analysis is verifiable if you can check the arithmetic. A navigation response is verifiable if you can test whether the route makes sense for someone in a car. A customer support response is verifiable if you can check whether the underlying issue got resolved.</p><p>Where these two conditions hold, you can build real evaluation infrastructure. Where they don&#8217;t &#8212; where quality is subjective or completion is hard to verify &#8212; you&#8217;re flying blind, and the jagged frontier will find the edges of your understanding before your evals do.</p><h2>The frontier shifts</h2><p>One more thing that makes this hard: the frontier isn&#8217;t static.</p><p>Every model update moves it. Some capabilities improve. Some degrade. Some new failure modes appear that didn&#8217;t exist before. The BCG study was run at a specific point in time with specific models. The specific numbers (40% quality improvement, 19% more errors) are already dated. The structural insight &#8212; that AI performance is jagged, with benefits and risks that aren&#8217;t correlated &#8212; will remain true for the foreseeable future, even as the specific contours of the frontier change.</p><p>This is why evaluation can&#8217;t be a one-time activity. You can&#8217;t benchmark your system in January, deploy it, and assume the frontier stays put. It doesn&#8217;t. Model updates are the obvious trigger, but user behavior drifts too &#8212; the distribution of inputs your system sees in month six is different from what it saw in month one. The task mix shifts. New use cases get discovered. What was inside the frontier may now be outside it, and vice versa.</p><p><a href="https://www.jetty.io">With Jetty</a>, we&#8217;re tackling this with agent runbooks, but we&#8217;re not alone in trying to build agentic hill-climbing as a service.</p><p>The &#8220;fix everything and measure once&#8221; assumption is exactly the wrong model. The right model is continuous: <a href="https://lebensold.substack.com/p/ai-breaks-the-two-assumptions-behind">define quality, measure it, improve it, measure again</a>. A living process, not a checkpoint.</p><p>The question worth sitting with: if the jagged frontier is real, and measurable, and shifts with every model update &#8212; what would it actually take to map it before your users discover the cliffs for you?</p><p>I don&#8217;t think the answer is more dashboards. Dashboards show you aggregate metrics on the tasks you&#8217;re already measuring. The dangerous zone is the unmapped territory. Finding it requires deliberately probing the edges &#8212; constructing evals for tasks you think are covered but haven&#8217;t tested, running them when models update, building the granular per-task quality infrastructure that call centers take for granted.</p><p>It&#8217;s boring work. It&#8217;s not a new feature. Nobody&#8217;s going to celebrate the eval suite you built. But it&#8217;s the difference between knowing your system&#8217;s frontier and hoping you&#8217;ve been lucky about where the cliffs are.</p>]]></content:encoded></item><item><title><![CDATA[Visual workflows are procedural programming in a costume]]></title><description><![CDATA[Why outcome specs beat node graphs in production]]></description><link>https://blog.jetty.io/p/visual-workflows-are-procedural-programming</link><guid isPermaLink="false">https://blog.jetty.io/p/visual-workflows-are-procedural-programming</guid><dc:creator><![CDATA[Jonathan Lebensold]]></dc:creator><pubDate>Tue, 31 Mar 2026 13:56:45 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!CXcS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bd15e7d-8cfa-4bca-92ae-f78ddba098d9_1408x768.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I&#8217;m always blown away looking at an agent described as an visual workflow. The boxes and lines expanding a collapsing are a joy to witness, and they scratch the same part of my brain that obsessed over tech trees in real-time strategy games. A system that answers questions is defined as dozens of lines and boxes that culminate in a summary in a markdown file.</p><p>By the time they&#8217;re done, the canvas looked like a circuit board designed by someone having a bad day. Forty-something nodes. Arrows crossing arrows.</p><p>You can see everything while simultaneously understanding almost nothing.</p><h2>Boxes and arrows are just code in a costume</h2><p>Here&#8217;s something that took me too long to realize about visual AI workflow builders. Strip away the drag-and-drop interface and what&#8217;s underneath is a control flow graph. Boxes are functions. Arrows are return values. Conditionals are if-statements. Retry loops are while-loops.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CXcS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bd15e7d-8cfa-4bca-92ae-f78ddba098d9_1408x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CXcS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bd15e7d-8cfa-4bca-92ae-f78ddba098d9_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!CXcS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bd15e7d-8cfa-4bca-92ae-f78ddba098d9_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!CXcS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bd15e7d-8cfa-4bca-92ae-f78ddba098d9_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!CXcS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bd15e7d-8cfa-4bca-92ae-f78ddba098d9_1408x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CXcS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bd15e7d-8cfa-4bca-92ae-f78ddba098d9_1408x768.jpeg" width="1408" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4bd15e7d-8cfa-4bca-92ae-f78ddba098d9_1408x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1095531,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://lebensold.substack.com/i/192475698?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bd15e7d-8cfa-4bca-92ae-f78ddba098d9_1408x768.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CXcS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bd15e7d-8cfa-4bca-92ae-f78ddba098d9_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!CXcS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bd15e7d-8cfa-4bca-92ae-f78ddba098d9_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!CXcS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bd15e7d-8cfa-4bca-92ae-f78ddba098d9_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!CXcS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bd15e7d-8cfa-4bca-92ae-f78ddba098d9_1408x768.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>It&#8217;s procedural programming. The same paradigm as Python or TypeScript, just rendered as a diagram instead of text.</p><p>This matters because the visual layer doesn&#8217;t change the paradigm. It changes the <em>representation</em>. And the new representation is worse in almost every way that matters for production systems: harder to version control, harder to diff, harder to review, harder to refactor, harder to compose.</p><p>Over 25 years ago, L. Peter Deutsch watched a talk on visual programming and made an observation that became known as the <a href="https://en.wikipedia.org/wiki/Deutsch_limit">Deutsch Limit</a>: &#8220;The problem with visual programming is that you can&#8217;t have more than 50 visual primitives on the screen at the same time. How are you going to write an operating system?&#8221;</p><p>He was talking about general-purpose software. AI workflows hit that limit much faster. A moderately complex agent with tool calls, branching logic, evaluation steps, retry loops, and error handling blows past 50 nodes before you&#8217;ve handled the happy path. The visual representation that was supposed to make the system legible makes it less legible. The diagram becomes the thing you need a diagram to explain.</p><h2>The shift nobody is talking about</h2><p>The software industry has been through this exact transition before. More than once.</p><p>When you write SELECT * FROM orders WHERE total &gt; 100, you don&#8217;t specify which index to use. You don&#8217;t tell the database whether to do a sequential scan or a hash join. You don&#8217;t manage memory allocation or disk I/O. You describe the result you want, and the query optimizer &#8212; decades of engineering condensed into a planner &#8212; figures out the execution path. This is what makes SQL so durable. The same query runs on SQLite and on a distributed Snowflake cluster. The <em>what</em> stays the same. The <em>how</em> adapts to the context.</p><p><a href="https://www.terraform.io/">Terraform</a> did the same thing for infrastructure. Before Terraform, deploying infrastructure meant writing imperative scripts: create this server, configure this load balancer, attach this security group, in this order, and hope nothing fails halfway through. Terraform replaced all of that with a <a href="https://developer.hashicorp.com/terraform/language">declaration</a>: &#8220;I want 5 instances behind a load balancer in us-east-1.&#8221; The system reads your desired state, compares it to reality, and converges.</p><p><a href="https://kubernetes.io/">Kubernetes</a> did it for container orchestration. &#8220;I want 3 replicas of this service, always.&#8221; Not &#8220;launch a container, check if it&#8217;s healthy, restart it if it crashes, scale up if load increases.&#8221; You declare the outcome. The system maintains it.</p><p>Every one of these transitions followed the same arc: procedural tools that worked fine at small scale became unmanageable as complexity grew. Declarative tools replaced them not by doing the same thing with a nicer interface, but by operating at a different level of abstraction.</p><p>What&#8217;s the word for the version of this shift in AI evaluation? I think it&#8217;s happening right now, and most people are building on the wrong side of it.</p><h2>The plumbing problem</h2><p>Now look at how visual AI workflow tools handle evaluation. Take a concrete example: <a href="https://docs.vellum.ai/product/workflows/common-architectures">Vellum&#8217;s recommended architecture</a> for a RAG evaluation loop. You wire a Prompt Node to a <a href="https://docs.vellum.ai/product/workflows/nodes/guardrail-node">Guardrail Node</a> that runs an evaluation metric at runtime &#8212; say, Ragas Faithfulness. If the score is below your threshold, you route the failure path through a Conditional Node back to the Prompt Node for another attempt. You need a Templating Node to track the Prompt Node&#8217;s execution counter so the loop doesn&#8217;t run forever. You add a second Conditional branch for when retries are exhausted. You add a Try adornment on the Prompt Node to expose an error output for non-deterministic failures. You wire the success path through a Merge Node to a Final Output Node.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!G2oZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3154ffd7-907e-4eb6-9803-c1953f742a6b_1260x848.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!G2oZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3154ffd7-907e-4eb6-9803-c1953f742a6b_1260x848.jpeg 424w, https://substackcdn.com/image/fetch/$s_!G2oZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3154ffd7-907e-4eb6-9803-c1953f742a6b_1260x848.jpeg 848w, https://substackcdn.com/image/fetch/$s_!G2oZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3154ffd7-907e-4eb6-9803-c1953f742a6b_1260x848.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!G2oZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3154ffd7-907e-4eb6-9803-c1953f742a6b_1260x848.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!G2oZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3154ffd7-907e-4eb6-9803-c1953f742a6b_1260x848.jpeg" width="1260" height="848" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3154ffd7-907e-4eb6-9803-c1953f742a6b_1260x848.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:848,&quot;width&quot;:1260,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!G2oZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3154ffd7-907e-4eb6-9803-c1953f742a6b_1260x848.jpeg 424w, https://substackcdn.com/image/fetch/$s_!G2oZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3154ffd7-907e-4eb6-9803-c1953f742a6b_1260x848.jpeg 848w, https://substackcdn.com/image/fetch/$s_!G2oZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3154ffd7-907e-4eb6-9803-c1953f742a6b_1260x848.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!G2oZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3154ffd7-907e-4eb6-9803-c1953f742a6b_1260x848.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://www.fiverr.com/moinkhan40/n8n-ai-agent-n8n-automation-n8n-workflow-ai-agent-n8n-ai-automation-n8n-expert">Example of</a> an N8N workflow for scheduling a calendar invite. </figcaption></figure></div><p>That&#8217;s <a href="https://docs.vellum.ai/product/workflows/nodes/overview">seven node types</a> &#8212; Prompt, Guardrail, Conditional, Templating, Merge, Final Output, and Error &#8212; plus adornments, for a single evaluate-and-retry loop on a single output. And this is the <em>recommended</em> pattern. If you want to evaluate multiple dimensions (factual accuracy <em>and</em> tone <em>and</em> completeness), each dimension needs its own Guardrail Node, its own branch, its own merge. The graph doesn&#8217;t grow linearly. It grows combinatorially.</p><p>Other tools have the same problem. On one community forum, a user <a href="https://community.n8n.io/t/how-to-implement-a-conditional-while-loop-in-n8n-workflow-without-llm-ai-agent-node/120438">asked how to implement a conditional while loop</a> &#8212; a basic evaluation retry. The answer: chain a Loop node to an IF node to a Wait node, wire the true path back to the loop, and &#8220;be very careful on trigger infinite loop.&#8221; That&#8217;s the level of primitive you&#8217;re working with.</p><p>You&#8217;ve specified the procedure, step by excruciating step. If you want to change the evaluation criteria, you&#8217;re re-wiring nodes. If you want to add a rubric dimension, you&#8217;re adding boxes and arrows. If you want to change the retry strategy, you&#8217;re restructuring the graph.</p><p>That&#8217;s not evaluation. That&#8217;s plumbing. And just like imperative deployment scripts before Terraform, the plumbing obscures the thing that actually matters: <em>what does &#8220;good&#8221; look like?</em></p><h2>What declarative evaluation looks like</h2><p>Here&#8217;s an evaluation specification from a <a href="https://docs.jetty.io/docs/guides/writing-runbooks">Jetty runbook</a> I use in production:</p><blockquote><p>The output must contain these 7 files. Each file must pass schema validation. The narrative summary must score 7+ against these rubric dimensions: factual accuracy, clinical relevance, actionable recommendations, appropriate tone. If it scores below 7, identify the weakest dimension, revise, and re-evaluate. Maximum 3 iterations.</p></blockquote><p>That&#8217;s the entire evaluation logic. Twelve lines of markdown. No boxes. No arrows. No retry-loop wiring. No conditional branches.</p><p>The agent reads this and does what a competent human would: produces the output, checks it against the criteria, fixes what&#8217;s weak, and iterates until the bar is met or the budget is exhausted. The <em>what</em> is specified precisely. The <em>how</em> is left to the executor.</p><p>Compare this to the visual workflow version of the same logic. I counted the nodes in a visual implementation of this evaluation loop: 23 nodes, 31 connections, and I still hadn&#8217;t handled the case where schema validation fails on a different file than the one the rubric scored lowest.</p><p>The declarative version is not just shorter. It&#8217;s a different kind of thing. It&#8217;s a specification, not a procedure. And specifications have properties that procedures don&#8217;t.</p><h2>Specifications compose. Procedures tangle.</h2><p>Want to add a new rubric dimension to a declarative evaluation? Add a line. Want to change the iteration limit? Change a number. Want to swap the underlying model? Change a parameter. Want to apply the same evaluation criteria to a different pipeline? Copy the specification and change the input path.</p><p>Try any of these in a visual workflow builder. Adding a rubric dimension means adding evaluation nodes, re-wiring branches, and testing the new graph. Changing the iteration limit means restructuring the retry loop. Swapping the model might mean replacing nodes that have different input/output schemas. Applying the evaluation to a different pipeline means... rebuilding the workflow from scratch, because the node graph doesn&#8217;t separate the evaluation logic from the execution context.</p><p>This is why nobody writes Terraform in a drag-and-drop GUI. The text representation is more powerful, not less. Not because text is inherently better than graphics &#8212; but because declarative text separates <em>what</em> from <em>how</em> in a way that visual procedures cannot.</p><p>SQL views compose because each one is a self-contained query with declared dependencies. Terraform modules compose because each one is a self-contained state declaration. Kubernetes manifests compose because each one is a self-contained desired-state spec. Evaluation specifications compose because each one is a self-contained quality bar.</p><p>Visual workflow nodes don&#8217;t compose. They <em>connect</em>. And connections are fragile. Move one node and three arrows break.</p><h2>The version control problem is fatal</h2><p>Here&#8217;s the tell that the visual approach has a fundamental problem: every serious visual workflow tool eventually builds a text-based SDK.</p><p>One platform launched a <a href="https://www.vellum.ai/blog/vellum-workflows-sdk-is-generally-available">Workflows SDK</a> with a CLI that does &#8220;bi-directional syncing&#8221; between the canvas and code. Another open-sourced a text format alongside its visual editor. A third supports workflow-as-code. They all end up in the same place: acknowledging that the visual representation isn&#8217;t the source of truth. It&#8217;s a <em>rendering</em> of the source of truth. And the source of truth is text.</p><p>The reason is version control. Visual workflows serialize to JSON blobs &#8212; hundreds or thousands of lines of auto-generated coordinates, node IDs, and connection metadata. You cannot meaningfully diff them. You cannot code review them. You cannot grep for the line where the evaluation threshold is defined. When two people edit the same workflow, you cannot merge their changes.</p><p><strong>The right abstraction to visualize is </strong><em><strong>outcomes</strong></em><strong>:</strong> what does the evaluation measure, what&#8217;s the quality bar, and what are the iteration bounds. The procedure is an implementation detail the system should handle, the same way a SQL query optimizer handles join ordering.</p><p><a href="https://www.promptfoo.dev/">Promptfoo</a> seems to get this. Its evaluation framework uses declarative YAML &#8212; you specify assertions (string matching, LLM-as-judge rubrics, schema validation) and it handles execution. It&#8217;s closer to a testing DSL than a visual workflow. That&#8217;s the direction the ecosystem should be heading.</p><h2>The parallel to testing</h2><p>This connects to something I&#8217;ve been writing about for a while. CI/CD worked because it made every change small, testable, and reviewable. The CI loop depends on artifacts that diff cleanly: code in text files, tests in text files, configuration in text files.</p><p>Visual workflow builders fight this at every level. Changes don&#8217;t diff. Reviews are &#8220;look at my screen and tell me if this graph looks right.&#8221; Rollbacks mean &#8220;restore the previous version of an opaque JSON blob.&#8221; The very properties that made CI/CD transformative &#8212; small changes, clean diffs, automated testing, code review &#8212; are the ones that visual workflows undermine.</p><p>Declarative evaluation aligns with CI because it produces the same kind of artifacts. A runbook is a text file. It diffs. It reviews. It lives in a PR. When you change the evaluation criteria, the diff shows exactly what changed and nothing else. When you add a rubric dimension, the reviewer can assess whether it makes sense without understanding the execution plumbing.</p><p>This is the same advantage Terraform has over bash deployment scripts, and SQL has over hand-rolled data-processing code. The artifact is reviewable because it describes intent, not mechanism.</p><p>Procedural visual tools will always address an important segment of the market. Some tasks genuinely need step-by-step control &#8212; data plumbing, API integrations, deterministic pipelines where every step is known in advance. That&#8217;s real, and visual builders serve it well.</p><p>But most AI tasks aren&#8217;t like that. Most AI tasks are measured by their <em>outcome</em>, not by whether you followed the right steps. Did the summary capture the key points? Did the generated image match the brand guidelines? Did the evaluation catch the regression? These are outcome specifications, not procedure specifications. It&#8217;s the difference between listing the ingredients for a dish and prescribing exactly which store to visit, which aisle to walk down, and which hand to reach with.</p><p>The most sophisticated agent orchestration I&#8217;ve seen isn&#8217;t a graph. It&#8217;s a markdown file with a rubric, an output manifest, and a verification script. It version-controls perfectly. It diffs in a PR. It&#8217;s human-readable and machine-executable. It composes by copy-paste and edit, the way SQL views do.</p><p>The format sounds primitive, but so did SELECT * FROM when compared to a visual query builder. The power was never in the interface. It was in the abstraction.</p><h2>Where visual workflows make sense</h2><p>It&#8217;s not all bad! There are some cases where a visual workflow tool can be a huge productivity gain. <a href="https://cycling74.com/products/max">MaxMSP is a fantastic example</a> of how a visual workflow tool can actually increase visibility into what&#8217;s going on under the hood. Tools like After Effects, Blender and many shader mapping interfaces for game designers are also good examples. But I think in each of these instances, an expert has to be prepared to learn and adopt the visual programming environment wholesale&#8212;and they still face an upper bound in terms of how much complexity any of these visualizations can communicate to humans.</p>]]></content:encoded></item><item><title><![CDATA[Runbooks: what agents need to hill-climb]]></title><description><![CDATA[The Missing Layer Between &#8220;Call This API&#8221; and &#8220;Accomplish This Outcome&#8221;]]></description><link>https://blog.jetty.io/p/runbooks-what-agents-need-to-hill</link><guid isPermaLink="false">https://blog.jetty.io/p/runbooks-what-agents-need-to-hill</guid><dc:creator><![CDATA[Jonathan Lebensold]]></dc:creator><pubDate>Fri, 27 Mar 2026 01:59:33 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!sc53!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3608198-c154-43eb-af3c-4be78fbecbcf_1408x768.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Last month I watched an agent run a six-step evaluation pipeline. It called the right APIs. It generated SQL that was mostly correct. It even caught a schema error and fixed it on the second try. Then it wrote a summary, declared the task complete, and stopped.</p><p>It had skipped two of the six steps entirely. The output directory was missing three of five required files. The summary confidently described results from steps that never ran.</p><p>This is the failure mode nobody talks about. Not &#8220;the agent can&#8217;t do the task.&#8221; The agent <em>can</em> do the task. It just doesn&#8217;t finish it. It encounters an error on step four, routes around it, produces whatever it can, and wraps up with the confidence of someone who definitely completed all the work.</p><p>If you&#8217;ve built anything non-trivial with coding agents, you&#8217;ve seen this. The agent is capable but unreliable. It needs something between a one-line instruction and a shell script. It needs what I&#8217;ve started calling a runbook.</p><h2>Skills, workflows, and the gap between them</h2><p>Most agent tooling falls into two buckets.</p><p><strong>Skills</strong> are single-turn instructions. &#8220;Here&#8217;s how to call the Jetty API.&#8221; &#8220;Here&#8217;s how to query Snowflake.&#8221; They&#8217;re reference cards. Useful, but they don&#8217;t handle multi-step processes where the output of step three determines what you do in step four.</p><p><strong>Workflows</strong> are fixed pipelines. Step A feeds step B feeds step C. They&#8217;re deterministic, which is their strength and their limitation. When the task requires judgment &#8212; &#8220;is this SQL output correct?&#8221; or &#8220;does this image match the brand guidelines?&#8221; &#8212; a workflow can&#8217;t adapt.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YHRR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e5bf17f-f72e-420e-90e9-b1b4cd3f4903_1440x752.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YHRR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e5bf17f-f72e-420e-90e9-b1b4cd3f4903_1440x752.png 424w, https://substackcdn.com/image/fetch/$s_!YHRR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e5bf17f-f72e-420e-90e9-b1b4cd3f4903_1440x752.png 848w, https://substackcdn.com/image/fetch/$s_!YHRR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e5bf17f-f72e-420e-90e9-b1b4cd3f4903_1440x752.png 1272w, https://substackcdn.com/image/fetch/$s_!YHRR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e5bf17f-f72e-420e-90e9-b1b4cd3f4903_1440x752.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YHRR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e5bf17f-f72e-420e-90e9-b1b4cd3f4903_1440x752.png" width="1440" height="752" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4e5bf17f-f72e-420e-90e9-b1b4cd3f4903_1440x752.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:752,&quot;width&quot;:1440,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:125426,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://lebensold.substack.com/i/192271201?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e5bf17f-f72e-420e-90e9-b1b4cd3f4903_1440x752.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YHRR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e5bf17f-f72e-420e-90e9-b1b4cd3f4903_1440x752.png 424w, https://substackcdn.com/image/fetch/$s_!YHRR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e5bf17f-f72e-420e-90e9-b1b4cd3f4903_1440x752.png 848w, https://substackcdn.com/image/fetch/$s_!YHRR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e5bf17f-f72e-420e-90e9-b1b4cd3f4903_1440x752.png 1272w, https://substackcdn.com/image/fetch/$s_!YHRR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e5bf17f-f72e-420e-90e9-b1b4cd3f4903_1440x752.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>That&#8217;s not a skill. It&#8217;s not a workflow. It&#8217;s a process that requires both execution and judgment, with the ability to recover when things go wrong.</p><h2>What a runbook actually is</h2><p><a href="https://docs.jetty.io/docs/guides/writing-runbooks">A runbook</a> is a structured markdown document that tells an agent how to accomplish an outcome. Not a procedure to follow blindly. An outcome to achieve, with enough guidance to get there.</p><p>The distinction matters. A shell script says &#8220;run these commands in this order.&#8221; A runbook says &#8220;here&#8217;s what must be true when you&#8217;re done, here&#8217;s how to evaluate whether you&#8217;re there, and here&#8217;s what to do when you&#8217;re not.&#8221;</p><p>The critical sections, in order:</p><p><strong>An objective.</strong> Two to five sentences that answer: what am I doing, what am I producing, and for whom? The agent should be able to read this in ten seconds and orient.</p><p><strong>An output manifest.</strong> The exact files that must exist when the task is complete. This is deliberately aggressive:</p><p>You MUST write all of the following files. The task is NOT complete until every file exists and is non-empty. No exceptions.</p><p>That tone exists for a reason. Agents are polite. They want to wrap up gracefully even when they haven&#8217;t finished. The manifest is a forcing function against the premature completion problem I described at the top.</p><p><strong>Evaluation criteria.</strong> How the agent knows whether its output is good enough. This is the section that separates a runbook from a to-do list.</p><p><strong>An iteration loop.</strong> What to do when evaluation fails. Try again, but differently, and with a cap on how many times.</p><p><strong>A final checklist with a verification script.</strong> A bash script that checks every output file exists and is non-empty, plus a prose checklist the agent walks through before declaring completion.</p><p>That last part &#8212; the script &#8212; is the only reliable way to prevent the failure I opened with. Without it, the agent will skip steps and tell you everything went great.</p><h2>The hill-climbing loop</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sc53!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3608198-c154-43eb-af3c-4be78fbecbcf_1408x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sc53!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3608198-c154-43eb-af3c-4be78fbecbcf_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!sc53!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3608198-c154-43eb-af3c-4be78fbecbcf_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!sc53!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3608198-c154-43eb-af3c-4be78fbecbcf_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!sc53!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3608198-c154-43eb-af3c-4be78fbecbcf_1408x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sc53!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3608198-c154-43eb-af3c-4be78fbecbcf_1408x768.jpeg" width="1408" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f3608198-c154-43eb-af3c-4be78fbecbcf_1408x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1082441,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lebensold.substack.com/i/192271201?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3608198-c154-43eb-af3c-4be78fbecbcf_1408x768.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sc53!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3608198-c154-43eb-af3c-4be78fbecbcf_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!sc53!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3608198-c154-43eb-af3c-4be78fbecbcf_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!sc53!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3608198-c154-43eb-af3c-4be78fbecbcf_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!sc53!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3608198-c154-43eb-af3c-4be78fbecbcf_1408x768.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>Every runbook contains at least one judge-refine-rejudge cycle. The agent produces output, evaluates it against criteria, and iterates if it falls short.</p><p>This is the same hill-climbing pattern that works in optimization: define a quality bar, measure against it, improve the weakest dimension, measure again. The runbook just makes it explicit and bounded.</p><p>Bounded is the key word. Without a cap, agents will iterate forever or give up after one attempt. Three rounds is the sweet spot I&#8217;ve landed on. Enough to converge on most issues, not enough to burn through your API budget on a lost cause. The runbook specifies what happens when you hit the ceiling: keep the best attempt, flag for human review, or both.</p><p>Two evaluation patterns cover almost everything:</p><p><strong>Programmatic validation</strong> for structured output. Does the JSON schema-validate? Does the SQL execute? Do the tests pass? Error messages are specific and actionable, so the agent converges in one or two rounds.</p><p><strong>Rubric-based judgment</strong> for creative or complex output. Score against multiple criteria on a 1-5 scale, with a pass threshold (like &#8220;overall &gt;= 4.0, no criterion below 3&#8221;). The agent identifies the weakest criterion and makes targeted improvements. A &#8220;Common Fixes&#8221; table maps failures to concrete actions &#8212; this is where you encode the domain expertise that prevents the agent from thrashing.</p><p>The pattern you choose depends on the output. Don&#8217;t rubric-score a JSON file. Don&#8217;t schema-validate a marketing graphic.</p><h2>The new-hire test</h2><p>Here&#8217;s the mental model I use. A runbook is what you&#8217;d write for a competent new hire who needs to run your pipeline while you&#8217;re on vacation.</p><p>You wouldn&#8217;t write a shell script. Too brittle &#8212; the first unexpected error kills it. You wouldn&#8217;t just say &#8220;figure it out.&#8221; Too vague &#8212; they&#8217;ll make assumptions you&#8217;d never make. You&#8217;d write something in between: the process with enough detail to recover from common failures and enough latitude to adapt when something unexpected happens.</p><p>You&#8217;d include the API calls they&#8217;ll need, with examples. You&#8217;d describe what good output looks like. You&#8217;d list the things that commonly go wrong and how to fix them. You&#8217;d tell them exactly which files to produce and how to verify they&#8217;re correct before calling it done.</p><p>That&#8217;s a runbook. The agent is the new hire. The markdown is the document you leave behind.</p><h2>Tips are earned, not invented</h2><p>The last section of every runbook is &#8220;Tips.&#8221; These aren&#8217;t generic best practices. They&#8217;re hard-won operational knowledge from watching agents actually run the process and fail.</p><p>Things like: &#8220;Langfuse auth uses HTTP Basic, not Bearer &#8212; agents default to Bearer and get a confusing 401.&#8221; Or: &#8220;Snowflake function names differ from Spark. If the SQL references ARRAY_AGG, the agent will need to use ARRAY_CONSTRUCT instead.&#8221;</p><p>These accumulate over time. Each failed run teaches you something the next version of the runbook should encode. The tips section is the runbook&#8217;s institutional memory &#8212; the things you&#8217;d tell the new hire over coffee that aren&#8217;t in any documentation.</p><h2>Try this now</h2><p>This isn&#8217;t theoretical. The <a href="https://skills.sh/jettyio/agent-skill">Jetty agent-skill</a> ships with tooling for running and validating runbooks.</p><p><a href="http://validate-runbook.sh">validate-runbook.sh</a> checks structural completeness without executing anything. It tells you whether your runbook has all the required sections, whether your template variables are declared, whether your evaluation criteria exist. Think of it as a linter for operational documents.</p><p><a href="http://run-runbook.sh">run-runbook.sh</a> reads a parameters JSON, injects template variables, and invokes the agent with the runbook as its instruction set. It supports a --dry-run mode where the agent reads the runbook and produces an execution plan without making any API calls &#8212; useful when the pipeline involves expensive operations like image generation or database queries.</p><p>The barrier to entry is: write a markdown file with the sections above, validate it, and run it. If you&#8217;ve already got a process you run manually or a pipeline that an agent keeps botching, that&#8217;s your first runbook.</p><h2>The contrarian bet</h2><p>There&#8217;s an irony here. While the industry is building increasingly complex agent frameworks &#8212; tool chains, memory systems, multi-agent orchestration, graph-based planners &#8212; the most reliable guidance mechanism I&#8217;ve found is a well-structured markdown file with a verification script at the bottom.</p><p>Markdown is plain text. It works with every agent. It&#8217;s version-controlled. It&#8217;s diffable. It&#8217;s readable by humans and machines. It doesn&#8217;t require a runtime, a framework, or a dependency. You can review it in a PR.</p><p>The sophistication isn&#8217;t in the format. It&#8217;s in what the document encodes: clear evaluation criteria, bounded iteration, concrete output requirements, and operational knowledge from real failures. That&#8217;s what makes an agent reliable. Not the orchestration layer. The quality of the instructions.</p><p>I suspect this will seem obvious in retrospect. The same way &#8220;just write tests&#8221; seems obvious now but was a hard sell in 2005. The discipline is in writing down what &#8220;done&#8221; looks like before you start &#8212; and giving the agent a way to check its own work.</p><p>What&#8217;s less obvious is where the ceiling is. How complex can the task be before a single markdown file stops being sufficient? I don&#8217;t know yet. But the tasks I&#8217;ve thrown at runbooks &#8212; evaluation pipelines, data ingestion, brand compliance checking, regression testing &#8212; keep working. The format scales further than I expected.</p><p>The question I&#8217;m sitting with: if the best agent orchestration is a document with a rubric and a bash script, what does that tell us about where the real leverage is in AI systems?</p>]]></content:encoded></item><item><title><![CDATA[Generation Got Cheap. Verification Didn't.]]></title><description><![CDATA[Cheaper tokens don't mean cheaper AI systems]]></description><link>https://blog.jetty.io/p/generation-got-cheap-verification</link><guid isPermaLink="false">https://blog.jetty.io/p/generation-got-cheap-verification</guid><dc:creator><![CDATA[Jonathan Lebensold]]></dc:creator><pubDate>Wed, 11 Mar 2026 17:06:50 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/c10f357a-b0d7-4737-9a25-be395768f423_1408x768.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Last spring, shipping with GPT-4o meant budgeting $5 per million input tokens. Enterprise teams paying for Claude Opus were spending $15 input, $75 output. Twelve months later, GPT-5.2 sells for <a href="https://intuitionlabs.ai/articles/chatgpt-api-pricing-2026-token-costs-limits">$1.75 input</a>. Anthropic <a href="https://www.metacto.com/blogs/anthropic-api-pricing-a-full-breakdown-of-costs-and-integration">slashed Opus by 67%</a>. Google&#8217;s Flash-Lite is down to <a href="https://yingtu.ai/en/blog/gemini-flash-api-cheapest-cost">$0.075 per million tokens</a>. DeepSeek <a href="https://www.reuters.com/technology/deepseek-releases-model-it-calls-intermediate-step-towards-next-generation-2025-09-29/">cut prices by half</a> and now lists <a href="https://api-docs.deepseek.com/quick_start/pricing">$0.28 input</a>.</p><p>The MTok crash is real. Across every weight class, generation costs fell 30&#8211;80% in a year.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YfiX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0218161-5041-4406-9a58-107a4eceb67d_1670x372.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YfiX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0218161-5041-4406-9a58-107a4eceb67d_1670x372.png 424w, https://substackcdn.com/image/fetch/$s_!YfiX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0218161-5041-4406-9a58-107a4eceb67d_1670x372.png 848w, https://substackcdn.com/image/fetch/$s_!YfiX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0218161-5041-4406-9a58-107a4eceb67d_1670x372.png 1272w, https://substackcdn.com/image/fetch/$s_!YfiX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0218161-5041-4406-9a58-107a4eceb67d_1670x372.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YfiX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0218161-5041-4406-9a58-107a4eceb67d_1670x372.png" width="1456" height="324" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a0218161-5041-4406-9a58-107a4eceb67d_1670x372.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:324,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:92193,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://lebensold.substack.com/i/190639228?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0218161-5041-4406-9a58-107a4eceb67d_1670x372.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YfiX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0218161-5041-4406-9a58-107a4eceb67d_1670x372.png 424w, https://substackcdn.com/image/fetch/$s_!YfiX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0218161-5041-4406-9a58-107a4eceb67d_1670x372.png 848w, https://substackcdn.com/image/fetch/$s_!YfiX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0218161-5041-4406-9a58-107a4eceb67d_1670x372.png 1272w, https://substackcdn.com/image/fetch/$s_!YfiX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0218161-5041-4406-9a58-107a4eceb67d_1670x372.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><p>But it misses the harder question. If generating output is approaching commodity pricing, what&#8217;s still expensive?</p><h2>The cost nobody talks about</h2><p>A team I work with runs a support chatbot. Last year it cost them about &#177;$18,000 a month on GPT-4o. When GPT-5.2 dropped at a third of the price, they migrated. The inference bill fell to roughly $6,000.</p><p>What didn&#8217;t change: the three engineers who review escalated cases. The QA process that spot-checks 2% of responses. The weekly meeting where someone eyeballs a dashboard and says &#8220;looks fine.&#8221; The verification layer stayed exactly the same size while the output tripled.</p><p>They didn&#8217;t save $12,000. They took $12,000 worth of increased risk.</p><p>This is happening everywhere. A <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6298838">recent paper from MIT</a> formalizes what practitioners already feel. Christian Catalini, Xiang Hui, and Jane Wu model the AI transition as a collision between two cost curves:</p><p><strong>The cost to generate</strong> (what they call c_A) is driven by compute and accumulated knowledge. As both scale, the cost drops exponentially. That&#8217;s the MTok crash in the table above.</p><p><strong>The cost to verify</strong> (c_H) is driven by human time, feedback latency, and expertise. It&#8217;s bounded by biology. An experienced engineer can review traces faster than a junior one, but experienced engineers are scarce, and their wages rise with scarcity. The paper calls this &#8220;verification cost disease&#8221;: even when experts get more efficient, verification gets more expensive because the demand for their judgment grows faster than the supply.</p><p>These curves are diverging. Generation costs collapse. Verification costs stay flat or rise. The gap between them is widening.</p><h2>The gap has a name</h2><p>Catalini et al. call it the Measurability Gap: the growing share of tasks where machines can cheaply generate output that humans cannot affordably verify. Their argument is that this gap, not the price of tokens, is the binding constraint on productive AI deployment.</p><p>Think about it in terms of your own system. <strong>Cheap tokens don&#8217;t just save money. They change what&#8217;s economically viable to automate.</strong> When GPT-4o cost $5 per million input tokens, teams were selective about which tasks they automated. They routed FAQ queries to the model and left complex cases to humans. The generation cost acted as a natural filter.</p><p>At $1.75 per million tokens, the filter dissolves. Teams start routing everything through the model. Not just FAQs but nuanced customer complaints, refund decisions, edge cases that used to get flagged for review. Each individual decision is defensible: the model handles it well enough, and it&#8217;s so cheap there&#8217;s no cost argument against it.</p><p>But &#8220;handles it well enough&#8221; is an impression, not a measurement. Nobody increased the verification budget to match the expanded scope. The team still reviews the same 2% sample. The same three engineers attend the same weekly meeting. And the fraction of output that&#8217;s actually verified shrinks with every new task the model picks up.</p><p>The paper formalizes this as four zones:</p><p><strong>Safe Industrial.</strong> Cheap to automate, cheap to verify. This is where AI success stories live. Chatbots answering FAQs. Classification tasks with clear ground truth. Code formatting. The output is easy to check, so automation works.</p><p><strong>Runaway Risk.</strong> Cheap to automate, expensive to verify. This is the zone that expands as token costs crash. The model can generate the output, but proving it&#8217;s correct requires human expertise that doesn&#8217;t scale with compute. Legal summaries. Medical triage suggestions. Financial recommendations. Content moderation at volume.</p><p><strong>Human Artisan.</strong> Expensive to automate, cheap to verify. Humans still do it better, and you can tell when they do. This zone is shrinking as models improve, but it&#8217;s where craftspeople live today.</p><p><strong>Pure Tacit.</strong> Expensive to automate, expensive to verify. Strategy. Judgment under uncertainty. The work that&#8217;s hard to even define, let alone evaluate.</p><p>The Runaway Risk zone is the one that should worry you. Every time a model gets cheaper, more tasks cross the automation threshold. But the verification threshold doesn&#8217;t move. The zone grows.</p><h2>What happens in the gap</h2><p>I&#8217;ve watched this play out in specific ways.</p><p>A marketing team I know went from producing 50 assets per campaign to over 4,000. Same team size. Same approval process, in theory. Three people can&#8217;t review 4,000 assets. They spot-checked, trusted the prompts, and shipped. When an off-brand image made it into a campaign, nobody could trace it back to the generation run that produced it.</p><p>That&#8217;s the Runaway Risk zone in action. Generation cost dropped to near zero. Verification didn&#8217;t scale to match.</p><p>Or take model swaps, the most common optimization move. A new model is cheaper. It scores higher on benchmarks. An engineer tests it against a handful of examples, everything looks good, they ship it. Three weeks later a support ticket arrives for a failure mode the old model handled correctly. Nobody connects the ticket to the swap because it doesn&#8217;t look like a regression. It looks like a new bug.</p><p>I wrote about this pattern in an <a href="https://jonlebensold.substack.com/">earlier piece</a>. Teams optimize individual steps without measuring the whole system. Each change is defensible in isolation. In aggregate, the system isn&#8217;t meaningfully different from where it started. Cheaper tokens accelerate this cycle. More model options, more swaps, more lateral moves disguised as progress.</p><p>The Catalini paper has a blunt name for unverified output: a &#8220;Trojan Horse&#8221; externality. It looks like productive work. It satisfies the metrics you&#8217;re tracking. But it accumulates hidden risk in the gap between what you can measure and what you can&#8217;t.</p><h2>Where the savings should go</h2><p>Here&#8217;s the math most teams don&#8217;t do.</p><p>That support bot dropped from $18,000 to $6,000 a month. The $12,000 savings is real. The question is where it goes. Most teams take it as margin. Finance is happy. The line item went down.</p><p>But the team also expanded the bot&#8217;s scope. It handles 3x more query types. The generation budget decreased while the verification burden increased. If none of that $12,000 flows into evaluation infrastructure, the team hasn&#8217;t saved money. They&#8217;ve converted an explicit cost (tokens) into an implicit one (undetected failures).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Xmix!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb358625c-35dc-4083-8817-215b7bac6ece_1408x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Xmix!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb358625c-35dc-4083-8817-215b7bac6ece_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Xmix!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb358625c-35dc-4083-8817-215b7bac6ece_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Xmix!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb358625c-35dc-4083-8817-215b7bac6ece_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Xmix!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb358625c-35dc-4083-8817-215b7bac6ece_1408x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Xmix!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb358625c-35dc-4083-8817-215b7bac6ece_1408x768.jpeg" width="1408" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b358625c-35dc-4083-8817-215b7bac6ece_1408x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:844232,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lebensold.substack.com/i/190639228?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb358625c-35dc-4083-8817-215b7bac6ece_1408x768.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Xmix!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb358625c-35dc-4083-8817-215b7bac6ece_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Xmix!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb358625c-35dc-4083-8817-215b7bac6ece_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Xmix!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb358625c-35dc-4083-8817-215b7bac6ece_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Xmix!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb358625c-35dc-4083-8817-215b7bac6ece_1408x768.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>What does reinvesting look like in practice?</p><p><strong>Eval suites on production traces.</strong> Not the handful of test cases assembled six months ago. Evaluation sets built from what the system actually encounters, refreshed continuously, covering the edge cases that model swaps introduce. If your eval set hasn&#8217;t changed since the last model migration, it&#8217;s lying to you.</p><p><strong>LLM judges for tasks humans can&#8217;t review at volume.</strong> A second model scoring the first against a written rubric. Does this response match our tone? Did the summary preserve the key facts? Is this classification consistent with how we&#8217;ve handled similar cases? Judges don&#8217;t replace human review, but they extend it. They catch the mechanical failures before a human ever has to look.</p><p><strong>Automated pipelines that flag regressions before users do.</strong> The loop I keep coming back to: ingest traces from production, run evaluations against them, surface the findings as concrete changes an engineer can review. Not a dashboard. A diff. Something that lives in the workflow you already have.</p><p>This is the work we do at <a href="https://jetty.io">Jetty</a>. We ingest traces from observability platforms, run evals, and produce pull requests with verified improvements. The MTok crash makes this more urgent, not less. When generation was expensive, the verification gap was narrow. Teams automated selectively and could review what they shipped. As generation approaches commodity pricing, the gap widens. The teams that invest in closing it will capture the surplus from every price cut. The teams that don&#8217;t will race to the bottom on inference costs and wonder why their systems aren&#8217;t getting better.</p><h2>The question worth asking</h2><p>The MTok crash didn&#8217;t make AI cheap. It made generation cheap. Those are different things.</p><p>Verification is the new scarce input. The ability to prove your system is working, to catch regressions when you swap models, to evaluate at the scale of your output rather than the scale of your team. That&#8217;s what&#8217;s still expensive. And unlike tokens, it doesn&#8217;t get cheaper by waiting for the next model release.</p><p>Every team I talk to has a version of the same plan: &#8220;We&#8217;ll optimize our AI stack once we have time.&#8221; The MTok crash gives them the budget. But budget without verification infrastructure is just faster generation of output nobody&#8217;s checking.</p><p>The question isn&#8217;t &#8220;which model should we use now that everything&#8217;s cheaper?&#8221; It&#8217;s &#8220;do we have any way to know if our system is working at the scale we&#8217;re running it?&#8221;</p><p>If the answer is no, cheaper tokens just means you&#8217;ll be wrong faster.</p>]]></content:encoded></item><item><title><![CDATA[How to Reliably Generate Content]]></title><description><![CDATA[The review bottleneck nobody rebuilt after AI removed the production one]]></description><link>https://blog.jetty.io/p/how-to-reliably-generate-content</link><guid isPermaLink="false">https://blog.jetty.io/p/how-to-reliably-generate-content</guid><dc:creator><![CDATA[Jonathan Lebensold]]></dc:creator><pubDate>Thu, 05 Mar 2026 17:56:40 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!b3a5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9a7ea3f-08d9-4a92-91fa-33f7e3361c92_1024x1024.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>A marketing team I know used to produce about 50 assets per campaign. Social cards, email variations, ad units. A designer, a copywriter, and a brand manager reviewed every piece. Slow, but reliable.</p><p>Last quarter they generated over 4,000 assets.</p><p>Same team size. Same approval process &#8212; in theory. In practice, three people can&#8217;t review 4,000 assets. They spot-checked, trusted the prompts, and shipped.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!b3a5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9a7ea3f-08d9-4a92-91fa-33f7e3361c92_1024x1024.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!b3a5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9a7ea3f-08d9-4a92-91fa-33f7e3361c92_1024x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!b3a5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9a7ea3f-08d9-4a92-91fa-33f7e3361c92_1024x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!b3a5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9a7ea3f-08d9-4a92-91fa-33f7e3361c92_1024x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!b3a5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9a7ea3f-08d9-4a92-91fa-33f7e3361c92_1024x1024.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!b3a5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9a7ea3f-08d9-4a92-91fa-33f7e3361c92_1024x1024.jpeg" width="1024" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e9a7ea3f-08d9-4a92-91fa-33f7e3361c92_1024x1024.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1102236,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://lebensold.substack.com/i/190020907?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9a7ea3f-08d9-4a92-91fa-33f7e3361c92_1024x1024.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!b3a5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9a7ea3f-08d9-4a92-91fa-33f7e3361c92_1024x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!b3a5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9a7ea3f-08d9-4a92-91fa-33f7e3361c92_1024x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!b3a5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9a7ea3f-08d9-4a92-91fa-33f7e3361c92_1024x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!b3a5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9a7ea3f-08d9-4a92-91fa-33f7e3361c92_1024x1024.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This is everywhere now. <a href="https://www.coca-colacompany.com/media-center/the-coca-cola-company-introduces-fizzion">Coca-Cola&#8217;s Fizzion platform</a> produces personalized imagery across 100+ markets. <a href="https://www.superside.com/blog/ai-marketing-campaigns">Moet Hennessy scaled to 3 million content variations</a> globally. The generation capacity is unbounded. The review capacity hasn&#8217;t changed at all.</p><h2>What happens when nobody&#8217;s checking</h2><p>We don&#8217;t have to imagine this. Coca-Cola&#8217;s <a href="https://www.marketingdive.com/news/why-coca-cola-keeps-pushing-limits-generative-ai-despite-backlash/804739/">AI-generated holiday campaign</a> drew immediate backlash &#8212; &#8220;soulless,&#8221; &#8220;creepy,&#8221; characters that looked almost human but not quite. Google Gemini generated <a href="https://blog.google/products/gemini/gemini-image-generation-issue/">historically inaccurate images</a>. Meta launched AI-generated fake profiles and <a href="https://adage.com/events-awards/aa-5-ai-brand-fails-and-lessons-for-marketers/">retracted them within days</a>.</p><p>These aren&#8217;t startups. These are companies with massive brand teams and explicit guidelines. They shipped anyway, because the speed of generation overwhelmed the humans who used to catch these problems.</p><p>And a <a href="https://phys.org/news/2025-11-backlash-ai-imagery-ads-begun.html">2025 study from the Nuremberg Institute</a> found that just labeling content as AI-generated makes people view it as less trustworthy. Bad AI content doesn&#8217;t just look bad &#8212; it actively erodes brand trust.</p><h2>LLM judges are the only path that scales</h2><p>If humans can&#8217;t review 4,000 assets, the only option is another AI. Uncomfortable conclusion, but it&#8217;s arithmetic.</p><p>Coca-Cola got here first. Fizzion encodes 140 years of brand rules &#8212; the red hue, logo spacing, typography &#8212; into machine-readable metadata called <a href="https://business.adobe.com/blog/adobe-and-coca-cola-co-innovate-on-project-fizzion">StyleID</a>. The AI that generates content is constrained by the AI that evaluates it. It&#8217;s now mandatory for all their agency partners.</p><p><a href="https://www.jasper.ai/brand-iq">Jasper&#8217;s Brand IQ</a> does something similar &#8212; an LLM judge that flags violations and suggests on-brand replacements in real-time. <a href="https://www.acrolinx.com/">Acrolinx</a> scores content against your style guide and tone of voice. These aren&#8217;t research projects. They&#8217;re production systems.</p><p>The pattern is LLM-as-judge: define criteria, feed the output to an evaluator, get a structured score. Does this copy match our voice? Does this image follow our guidelines? Is the tone right for this market? Even if you heavily discount the vendor claims, an automated judge that catches 80% of violations beats the current reality of reviewing 2% and hoping.</p><h2>Traceability is the other shoe</h2><p>There&#8217;s a second problem beyond quality: provenance. When you produce 4,000 assets from a base concept, you need to trace any one of them back to the prompt that generated it, the model version, the brand rules, and the judge score it received.</p><p>The <a href="https://artificialintelligenceact.eu/article/50/">EU AI Act</a>, taking effect August 2026, requires AI-generated content to be machine-readably marked and disclosed. Penalties: up to 15 million euros or 3% of global turnover. <a href="https://deepmind.google/technologies/synthid/">Google&#8217;s SynthID</a> has already watermarked over 20 billion pieces of AI content. The <a href="https://c2pa.org/">C2PA standard</a> embeds provenance metadata directly into files.</p><p>But watermarking solves attribution, not compliance. The full audit trail needs to connect generation to evaluation to deployment, with every decision logged. If you&#8217;ve built production AI systems, this should sound familiar &#8212; it&#8217;s the same observability infrastructure any LLM pipeline needs.</p><h2>The hard question</h2><p>Can LLM judges reliably catch the brand violations that actually matter?</p><p>Wrong logo, off-palette colours &#8212; easy. Any prompted evaluator catches those. The hard stuff is a headline that&#8217;s technically on-brand but tonally wrong for the British market. An image that follows every guideline but feels cheap. Copy that sounds robotic despite using approved terminology.</p><p>These are judgment calls from years of accumulated context &#8212; exactly the kind of tacit knowledge that&#8217;s hardest to encode, and exactly the kind of failure that does the most damage.</p><p>The teams doing this well use judges as a filter, not a replacement. Catch the mechanical violations automatically. Route flagged assets to humans with specific notes. The judge says: &#8220;This scored 6/10 on brand voice. Here&#8217;s why.&#8221; Automated tests catch regressions; code review catches design issues. You don&#8217;t skip either one, and you don&#8217;t run the automated tests by hand.</p><p>This is what we build at <a href="https://jetty.io">Jetty</a>. You define your brand rules as judge criteria, point it at your content pipeline, and the LLM judge scores every asset &#8212; then hill-climbs to improve the ones that fall short. Minutes to set up, not months.</p><p>The content industry built incredible generation capacity and forgot to build the CI pipeline. The teams that close that gap first won&#8217;t just generate better content &#8212; they&#8217;ll be able to prove it&#8217;s on-brand at a volume where proof actually matters.</p><p>The rest will keep spot-checking 2% and waiting for the next PR crisis.</p>]]></content:encoded></item><item><title><![CDATA[How LLM Judges Make AI Stop Looking Generic]]></title><description><![CDATA[How a scoring rubric replaced eyeballing for brand consistency]]></description><link>https://blog.jetty.io/p/how-llm-judges-make-ai-stop-looking</link><guid isPermaLink="false">https://blog.jetty.io/p/how-llm-judges-make-ai-stop-looking</guid><dc:creator><![CDATA[Jonathan Lebensold]]></dc:creator><pubDate>Mon, 02 Mar 2026 12:41:30 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!M5_y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9f11e41-466a-48d2-b1e7-4be0dcac2731_1376x768.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I needed brand-consistent illustrations. Not stock photos. Not whatever Midjourney feels like on a given day. Images that match a specific style guide: navy-and-cream linocut prints, or flat vector pelicans on dark blue. Visual consistency that makes a brand feel intentional.</p><p>The problem is obvious to anyone who&#8217;s tried. Without guardrails, every generation is a roll of the dice. One image comes back photorealistic. The next is cartoonish. A third nails the palette but the composition is wrong. You&#8217;re eyeballing each one, and eyeballing doesn&#8217;t scale. Worse, it doesn&#8217;t transfer. The person who finally cracked the right prompt for last week&#8217;s image isn&#8217;t around when someone else needs one this week.</p><p>This is the &#8220;prompt wizard&#8221; problem that every organization using generative AI runs into eventually. Someone develops an intuition for how to coax the right output from a model. That knowledge lives in their head. They become the spell-caster, and the rest of the team waits in line. It&#8217;s artisanal in the worst sense: unrepeatable, unverifiable, and fragile.</p><p>So I built a two-step workflow: generate an image, then judge it against my brand style guide. The loop takes about 30 seconds. Three iterations took me from 2/10 to 9/10. More importantly, the rubric means anyone can run it.</p><h2>The setup</h2><p><strong>Step one:</strong> generate an image with Gemini. </p><p><strong>Step two:</strong> send it to GPT-4o with a scoring rubric that describes my brand.</p><p><strong>The rubric:</strong></p><p>Rate how well this image matches the Pelican Brand style: minimalist flat vector illustration OR vintage hand-drawn printmaking sardine tin art. Key criteria: (1) limited color palette &#8212; navy, white, gold for flat vector OR 1-2 muted colors on cream for linocut, (2) correct style, (3) simple composition with the subject filling the frame, (4) NO photorealism, NO cartoon style, NO busy backgrounds. Score 1-10.</p><p>One generation model, one judge model, one rubric. The judge returns a score and an explanation of what&#8217;s wrong.</p><h2>Run 1: the raw prompt (2/10)</h2><p>I started with the kind of prompt you&#8217;d write for any image generator:</p><p>&#8220;A close-up of a magnifying glass held over a printed illustration. Through the lens, the image appears sharper and more defined. Warm studio lighting, shallow depth of field.&#8221;</p><p>The result looked like this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!M5_y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9f11e41-466a-48d2-b1e7-4be0dcac2731_1376x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!M5_y!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9f11e41-466a-48d2-b1e7-4be0dcac2731_1376x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!M5_y!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9f11e41-466a-48d2-b1e7-4be0dcac2731_1376x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!M5_y!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9f11e41-466a-48d2-b1e7-4be0dcac2731_1376x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!M5_y!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9f11e41-466a-48d2-b1e7-4be0dcac2731_1376x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!M5_y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9f11e41-466a-48d2-b1e7-4be0dcac2731_1376x768.jpeg" width="1376" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e9f11e41-466a-48d2-b1e7-4be0dcac2731_1376x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1376,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:734797,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lebensold.substack.com/i/189400021?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9f11e41-466a-48d2-b1e7-4be0dcac2731_1376x768.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!M5_y!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9f11e41-466a-48d2-b1e7-4be0dcac2731_1376x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!M5_y!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9f11e41-466a-48d2-b1e7-4be0dcac2731_1376x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!M5_y!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9f11e41-466a-48d2-b1e7-4be0dcac2731_1376x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!M5_y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9f11e41-466a-48d2-b1e7-4be0dcac2731_1376x768.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Technically impressive. Completely off-brand. The judge scored it 2/10: <em>&#8220;Photorealistic with complex shading and depth of field. Does not match either the flat vector or printmaking style.&#8221;</em></p><p>I knew this would fail. The point is that the judge articulates <em>why</em> in terms I can act on: too realistic, wrong style, too much detail.</p><h2>Run 2: adapted to brand style (8.5/10)</h2><p>I rewrote the prompt using a template I&#8217;d developed for the linocut style:</p><p>&#8220;Bold linocut block print illustration on cream paper. A circular feedback loop with four stations: a pencil, a picture frame, a magnifying glass, and a gauge. Thick arrows connect them in a cycle. Hand-carved woodcut style, imperfect ink edges, heavy line weight. Single color dark navy blue ink on cream paper. No text. Vintage printmaking aesthetic.&#8221;</p><p>Same concept. Completely different constraints:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KyVU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F649b0ccb-454e-407b-bac1-0b7f931e4085_1408x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KyVU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F649b0ccb-454e-407b-bac1-0b7f931e4085_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!KyVU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F649b0ccb-454e-407b-bac1-0b7f931e4085_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!KyVU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F649b0ccb-454e-407b-bac1-0b7f931e4085_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!KyVU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F649b0ccb-454e-407b-bac1-0b7f931e4085_1408x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KyVU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F649b0ccb-454e-407b-bac1-0b7f931e4085_1408x768.jpeg" width="1408" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/649b0ccb-454e-407b-bac1-0b7f931e4085_1408x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:800089,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lebensold.substack.com/i/189400021?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F649b0ccb-454e-407b-bac1-0b7f931e4085_1408x768.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KyVU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F649b0ccb-454e-407b-bac1-0b7f931e4085_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!KyVU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F649b0ccb-454e-407b-bac1-0b7f931e4085_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!KyVU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F649b0ccb-454e-407b-bac1-0b7f931e4085_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!KyVU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F649b0ccb-454e-407b-bac1-0b7f931e4085_1408x768.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>The judge scored it 8.5/10: <em>&#8220;Closely matches the vintage hand-drawn printmaking style. Limited color palette with navy on cream. Simple composition with the subject filling the frame.&#8221;</em></p><p>From 2 to 8.5 in one rewrite.</p><h2>Run 3: refined from feedback (9/10)</h2><p>The judge&#8217;s only complaint: composition could be tighter. I added &#8220;subject enclosed in a single bold circle, tightly cropped&#8221; and strengthened the ink texture language:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!l7jc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91ef6dff-421c-4c56-b6ea-c6fc64f3c184_1376x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!l7jc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91ef6dff-421c-4c56-b6ea-c6fc64f3c184_1376x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!l7jc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91ef6dff-421c-4c56-b6ea-c6fc64f3c184_1376x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!l7jc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91ef6dff-421c-4c56-b6ea-c6fc64f3c184_1376x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!l7jc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91ef6dff-421c-4c56-b6ea-c6fc64f3c184_1376x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!l7jc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91ef6dff-421c-4c56-b6ea-c6fc64f3c184_1376x768.jpeg" width="1376" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/91ef6dff-421c-4c56-b6ea-c6fc64f3c184_1376x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1376,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:848598,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lebensold.substack.com/i/189400021?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91ef6dff-421c-4c56-b6ea-c6fc64f3c184_1376x768.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!l7jc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91ef6dff-421c-4c56-b6ea-c6fc64f3c184_1376x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!l7jc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91ef6dff-421c-4c56-b6ea-c6fc64f3c184_1376x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!l7jc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91ef6dff-421c-4c56-b6ea-c6fc64f3c184_1376x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!l7jc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91ef6dff-421c-4c56-b6ea-c6fc64f3c184_1376x768.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Score: 9/10. Three runs. Five minutes of prompt editing. The judge did the hard work of evaluating consistency.</p><h2>Without the judge vs. with it</h2><p>To make the difference concrete, I ran five short nautical prompts through both pipelines. Same concepts, same image model. The only difference: one has the brand judge, one doesn&#8217;t.</p><p><strong>Without the judge</strong> &#8212; five 3-word prompts, no guardrails:</p><p>Anchor in rough seas:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wCx1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86789eb2-340c-4a96-8c53-254250816941_1408x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wCx1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86789eb2-340c-4a96-8c53-254250816941_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!wCx1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86789eb2-340c-4a96-8c53-254250816941_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!wCx1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86789eb2-340c-4a96-8c53-254250816941_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!wCx1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86789eb2-340c-4a96-8c53-254250816941_1408x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wCx1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86789eb2-340c-4a96-8c53-254250816941_1408x768.jpeg" width="1408" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/86789eb2-340c-4a96-8c53-254250816941_1408x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1055681,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lebensold.substack.com/i/189400021?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86789eb2-340c-4a96-8c53-254250816941_1408x768.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!wCx1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86789eb2-340c-4a96-8c53-254250816941_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!wCx1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86789eb2-340c-4a96-8c53-254250816941_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!wCx1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86789eb2-340c-4a96-8c53-254250816941_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!wCx1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86789eb2-340c-4a96-8c53-254250816941_1408x768.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!P-7U!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f0dcee2-c023-4908-b2cd-fd9ab7159f15_1408x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!P-7U!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f0dcee2-c023-4908-b2cd-fd9ab7159f15_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!P-7U!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f0dcee2-c023-4908-b2cd-fd9ab7159f15_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!P-7U!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f0dcee2-c023-4908-b2cd-fd9ab7159f15_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!P-7U!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f0dcee2-c023-4908-b2cd-fd9ab7159f15_1408x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!P-7U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f0dcee2-c023-4908-b2cd-fd9ab7159f15_1408x768.jpeg" width="1408" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1f0dcee2-c023-4908-b2cd-fd9ab7159f15_1408x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1187474,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lebensold.substack.com/i/189400021?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f0dcee2-c023-4908-b2cd-fd9ab7159f15_1408x768.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!P-7U!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f0dcee2-c023-4908-b2cd-fd9ab7159f15_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!P-7U!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f0dcee2-c023-4908-b2cd-fd9ab7159f15_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!P-7U!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f0dcee2-c023-4908-b2cd-fd9ab7159f15_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!P-7U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f0dcee2-c023-4908-b2cd-fd9ab7159f15_1408x768.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Lighthouse at night:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!C4Ch!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b4f169f-b85f-49a5-95a5-f1221a044cf4_1408x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!C4Ch!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b4f169f-b85f-49a5-95a5-f1221a044cf4_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!C4Ch!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b4f169f-b85f-49a5-95a5-f1221a044cf4_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!C4Ch!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b4f169f-b85f-49a5-95a5-f1221a044cf4_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!C4Ch!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b4f169f-b85f-49a5-95a5-f1221a044cf4_1408x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!C4Ch!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b4f169f-b85f-49a5-95a5-f1221a044cf4_1408x768.jpeg" width="1408" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1b4f169f-b85f-49a5-95a5-f1221a044cf4_1408x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:873356,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lebensold.substack.com/i/189400021?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b4f169f-b85f-49a5-95a5-f1221a044cf4_1408x768.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!C4Ch!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b4f169f-b85f-49a5-95a5-f1221a044cf4_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!C4Ch!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b4f169f-b85f-49a5-95a5-f1221a044cf4_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!C4Ch!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b4f169f-b85f-49a5-95a5-f1221a044cf4_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!C4Ch!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b4f169f-b85f-49a5-95a5-f1221a044cf4_1408x768.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rtpF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa58bae6b-b66b-420f-8c93-eace54e8430c_1408x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rtpF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa58bae6b-b66b-420f-8c93-eace54e8430c_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!rtpF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa58bae6b-b66b-420f-8c93-eace54e8430c_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!rtpF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa58bae6b-b66b-420f-8c93-eace54e8430c_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!rtpF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa58bae6b-b66b-420f-8c93-eace54e8430c_1408x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rtpF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa58bae6b-b66b-420f-8c93-eace54e8430c_1408x768.jpeg" width="1408" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a58bae6b-b66b-420f-8c93-eace54e8430c_1408x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1064410,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lebensold.substack.com/i/189400021?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa58bae6b-b66b-420f-8c93-eace54e8430c_1408x768.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rtpF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa58bae6b-b66b-420f-8c93-eace54e8430c_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!rtpF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa58bae6b-b66b-420f-8c93-eace54e8430c_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!rtpF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa58bae6b-b66b-420f-8c93-eace54e8430c_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!rtpF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa58bae6b-b66b-420f-8c93-eace54e8430c_1408x768.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Ship in a bottle:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4V0D!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5097d4b1-1311-447d-a76e-78243dac9232_1408x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4V0D!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5097d4b1-1311-447d-a76e-78243dac9232_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!4V0D!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5097d4b1-1311-447d-a76e-78243dac9232_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!4V0D!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5097d4b1-1311-447d-a76e-78243dac9232_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!4V0D!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5097d4b1-1311-447d-a76e-78243dac9232_1408x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4V0D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5097d4b1-1311-447d-a76e-78243dac9232_1408x768.jpeg" width="1408" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5097d4b1-1311-447d-a76e-78243dac9232_1408x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:825928,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lebensold.substack.com/i/189400021?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5097d4b1-1311-447d-a76e-78243dac9232_1408x768.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4V0D!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5097d4b1-1311-447d-a76e-78243dac9232_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!4V0D!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5097d4b1-1311-447d-a76e-78243dac9232_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!4V0D!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5097d4b1-1311-447d-a76e-78243dac9232_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!4V0D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5097d4b1-1311-447d-a76e-78243dac9232_1408x768.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6Kkn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff096d3e7-e29c-4e92-a120-ec9b302fcb43_1408x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6Kkn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff096d3e7-e29c-4e92-a120-ec9b302fcb43_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!6Kkn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff096d3e7-e29c-4e92-a120-ec9b302fcb43_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!6Kkn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff096d3e7-e29c-4e92-a120-ec9b302fcb43_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!6Kkn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff096d3e7-e29c-4e92-a120-ec9b302fcb43_1408x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6Kkn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff096d3e7-e29c-4e92-a120-ec9b302fcb43_1408x768.jpeg" width="1408" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f096d3e7-e29c-4e92-a120-ec9b302fcb43_1408x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:843514,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lebensold.substack.com/i/189400021?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff096d3e7-e29c-4e92-a120-ec9b302fcb43_1408x768.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!6Kkn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff096d3e7-e29c-4e92-a120-ec9b302fcb43_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!6Kkn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff096d3e7-e29c-4e92-a120-ec9b302fcb43_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!6Kkn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff096d3e7-e29c-4e92-a120-ec9b302fcb43_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!6Kkn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff096d3e7-e29c-4e92-a120-ec9b302fcb43_1408x768.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Sailor tying a knot:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!A79J!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36291e4d-7e6a-46a4-9b95-e844b65056d3_1408x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!A79J!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36291e4d-7e6a-46a4-9b95-e844b65056d3_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!A79J!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36291e4d-7e6a-46a4-9b95-e844b65056d3_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!A79J!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36291e4d-7e6a-46a4-9b95-e844b65056d3_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!A79J!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36291e4d-7e6a-46a4-9b95-e844b65056d3_1408x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!A79J!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36291e4d-7e6a-46a4-9b95-e844b65056d3_1408x768.jpeg" width="1408" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/36291e4d-7e6a-46a4-9b95-e844b65056d3_1408x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:984243,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lebensold.substack.com/i/189400021?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36291e4d-7e6a-46a4-9b95-e844b65056d3_1408x768.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!A79J!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36291e4d-7e6a-46a4-9b95-e844b65056d3_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!A79J!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36291e4d-7e6a-46a4-9b95-e844b65056d3_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!A79J!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36291e4d-7e6a-46a4-9b95-e844b65056d3_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!A79J!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36291e4d-7e6a-46a4-9b95-e844b65056d3_1408x768.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7f80!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F937e6b33-87ae-485b-8621-e4bf6486e1ef_1408x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7f80!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F937e6b33-87ae-485b-8621-e4bf6486e1ef_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!7f80!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F937e6b33-87ae-485b-8621-e4bf6486e1ef_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!7f80!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F937e6b33-87ae-485b-8621-e4bf6486e1ef_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!7f80!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F937e6b33-87ae-485b-8621-e4bf6486e1ef_1408x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7f80!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F937e6b33-87ae-485b-8621-e4bf6486e1ef_1408x768.jpeg" width="1408" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/937e6b33-87ae-485b-8621-e4bf6486e1ef_1408x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:958638,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lebensold.substack.com/i/189400021?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F937e6b33-87ae-485b-8621-e4bf6486e1ef_1408x768.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7f80!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F937e6b33-87ae-485b-8621-e4bf6486e1ef_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!7f80!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F937e6b33-87ae-485b-8621-e4bf6486e1ef_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!7f80!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F937e6b33-87ae-485b-8621-e4bf6486e1ef_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!7f80!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F937e6b33-87ae-485b-8621-e4bf6486e1ef_1408x768.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>Five prompts, five completely different visual styles. Photorealism, nature photography, portraiture, still life.<br><br>The prompt templates do the heavy lifting. Once I dialed them in with the judge&#8217;s feedback, producing a new on-brand image is a one-liner: swap in the concept, run the pipeline. No wizardry required.</p><h2>The pattern</h2><p>This isn&#8217;t really about images. It&#8217;s the same loop I keep running into with AI systems: close the gap between generation and evaluation.</p><ol><li><p><strong>Generate</strong> with whatever model you want</p></li><li><p><strong>Judge</strong> against a written rubric using a different model</p></li><li><p><strong>Read the feedback</strong> and adjust</p></li><li><p><strong>Repeat</strong> until the score converges</p></li></ol><p>The architecture is two API calls and a scoring prompt. You could wire this up with a script, but the reason <a href="https://flows.jetty.io">I built it as a Jetty pipeline</a> is that the same workflow runs for every image, every time, without someone babysitting it. Define the rubric once, and anyone on the team can generate on-brand images without becoming a prompt expert. The insight is that LLM-as-judge works for images, not just text. A written rubric can encode brand guidelines well enough to automate taste.</p><p>This is what kills the spell-caster problem. The wizard&#8217;s intuition about &#8220;what works&#8221; gets externalized into a rubric that anyone can run. No one needs to know which magic words make Gemini produce a good linocut. The judge tells you what&#8217;s wrong, and the fix is usually obvious from the feedback. The taste lives in the rubric, not in someone&#8217;s head.</p><p>Every image I generate without this loop is a coin flip that either reinforces my brand or dilutes it. With the loop, the images converge. After a few rounds, you develop reusable prompt templates that score 8+ on the first try.</p><h2>Try it</h2><p>The whole pipeline is <a href="https://github.com/jlebensold/brand-image-judge">one workflow definition and a rubric</a>. I built it by asking Jetty&#8217;s CLI to create a two-step pipeline: generate an image, then judge it. That&#8217;s it. The rubric is a text file. The prompt templates are text files. Once you have a look and feel that works, every new image is a one-line prompt with the concept swapped in.</p><p>If you want to adjust the judge, you don&#8217;t need to touch code. Update the rubric text and run it again. I&#8217;ve tweaked the scoring criteria three times since I started, each time just by editing what &#8220;on brand&#8221; means in plain English.</p><p>Fork the repo, swap in your own style guide, and run it. The loop works for any brand style, not just mine. Write a rubric that describes what &#8220;on brand&#8221; means for you. Be specific: name the colors, the style, what to avoid. The judge will tell you what&#8217;s wrong faster and more consistently than you can eyeball it.</p><p>If your brand images look different every time, the problem isn&#8217;t the image model. It&#8217;s the absence of a feedback loop.</p>]]></content:encoded></item><item><title><![CDATA[Meter Before You Manage]]></title><description><![CDATA[The three layers between your LLM bill and actually fixing it]]></description><link>https://blog.jetty.io/p/meter-before-you-manage</link><guid isPermaLink="false">https://blog.jetty.io/p/meter-before-you-manage</guid><dc:creator><![CDATA[Jonathan Lebensold]]></dc:creator><pubDate>Sun, 01 Mar 2026 16:20:16 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!XfDg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75a31c9c-a58f-499a-ae5e-34bd6a146521_1376x768.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Your CFO pings you about a $47,000 &#8220;OpenAI API&#8221; line item. You know it&#8217;s high. You&#8217;ve known for months. But when they ask which features drive the spend, you can&#8217;t answer. Not because you&#8217;re hiding something. Because you don&#8217;t know.</p><p>Which endpoint? Which model? Which user flow? The monthly invoice is a single opaque line, and every conversation about it ends the same way: &#8220;We&#8217;ll optimize later.&#8221;</p><p>Later never comes. Optimization without instrumentation is guessing, and guessing doesn&#8217;t get sprint tickets.</p><h2>The restaurant with no menu prices</h2><p>Imagine running a restaurant where you can see total food cost each month but not which dishes drive it. You spent $40,000 on ingredients. Is the wagyu steak killing you, or is it the house salad that secretly costs twice what you charge? Cut portion sizes? Switch suppliers? Drop a menu item? Without per-dish cost data, every move is a guess.</p><p>The generic LLM cost advice floating around is the same kind of guessing. &#8220;Use smaller models.&#8221; For which calls? &#8220;Cache repeated queries.&#8221; Which queries repeat? &#8220;Shorten your prompts.&#8221; Which ones, and by how much?</p><p>Good strategies, all of them. None are actionable without metering.</p><h2>What metering actually means</h2><p>Metering isn&#8217;t observability. You might already have traces flowing into Langfuse or Arize. Good. But traces tell you what happened on individual requests. Metering tells you what your system costs in aggregate, broken down by dimensions you can act on.</p><p>Three layers, each building on the last.</p><p><strong>Layer 1: The toll booth.</strong> Every LLM call goes through a proxy that records which model it hit, how many tokens it consumed, what it cost, and who requested it. LiteLLM is the most common choice here. One proxy, every provider, every call metered. No more reconciling your OpenAI bill against your Anthropic bill against your Azure bill in a spreadsheet. One ledger.</p><p>If your monthly cost conversation starts with someone logging into three different provider dashboards and adding numbers up manually, you&#8217;re at layer zero. Setting up a proxy is a day of work, maybe two. Per-team API keys give you basic allocation immediately.</p><p><strong>Layer 2: Attribution.</strong> Raw metering gives you totals by model and endpoint. Attribution gives you totals by business dimension. &#8220;The FAQ chatbot costs $18,000 a month.&#8221; &#8220;The document extraction pipeline is 60% of our spend.&#8221; &#8220;Team X&#8217;s experimental feature costs more than the core product.&#8221;</p><p>This is where Langfuse traces become powerful. Not as individual request debuggers, but as the data source for cost attribution. Tag traces by feature, team, and customer tier. Aggregate. Now you have a menu with prices.</p><p><strong>Layer 3: Optimization with evidence.</strong> Once you can see that the FAQ chatbot costs $18,000 a month, you can ask the right questions. Why so much? It handles 500,000 requests a month at 1,500 tokens average, mostly system prompt, on GPT-4o. Someone picked that model eight months ago and nobody revisited.</p><p>Now &#8220;use a smaller model&#8221; is actionable. Route FAQ classification to GPT-4o-mini and that $18,000 drops to under $3,000. Run the experiment. Measure quality. Show your CFO the before-and-after.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XfDg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75a31c9c-a58f-499a-ae5e-34bd6a146521_1376x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XfDg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75a31c9c-a58f-499a-ae5e-34bd6a146521_1376x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!XfDg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75a31c9c-a58f-499a-ae5e-34bd6a146521_1376x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!XfDg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75a31c9c-a58f-499a-ae5e-34bd6a146521_1376x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!XfDg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75a31c9c-a58f-499a-ae5e-34bd6a146521_1376x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XfDg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75a31c9c-a58f-499a-ae5e-34bd6a146521_1376x768.jpeg" width="1376" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/75a31c9c-a58f-499a-ae5e-34bd6a146521_1376x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1376,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1076781,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lebensold.substack.com/i/189363004?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75a31c9c-a58f-499a-ae5e-34bd6a146521_1376x768.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XfDg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75a31c9c-a58f-499a-ae5e-34bd6a146521_1376x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!XfDg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75a31c9c-a58f-499a-ae5e-34bd6a146521_1376x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!XfDg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75a31c9c-a58f-499a-ae5e-34bd6a146521_1376x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!XfDg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75a31c9c-a58f-499a-ae5e-34bd6a146521_1376x768.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Without layers 1 and 2, layer 3 is a fantasy. Every optimization tip on the internet assumes you&#8217;ve done the metering work. Almost nobody has.</p><h2>The scale of what&#8217;s hiding</h2><p>Once you have metering, the waste is hard to miss.</p><p>Roughly <a href="https://arxiv.org/abs/2403.02694">a third of LLM queries are semantically similar</a> to previous ones. A third of your spend, going to questions you&#8217;ve already answered. Semantic caching <a href="https://arxiv.org/abs/2411.05276">cuts 60-70% of those costs</a> and drops latency from hundreds of milliseconds to double digits. But you can&#8217;t build a cache policy without knowing which queries repeat.</p><p>The price gap between model tiers is <a href="https://intuitionlabs.ai/articles/llm-api-pricing-comparison-2025">20-30x, not 2x</a>. Flagship models like GPT-4o run $2.50-10 per million tokens. Budget-tier alternatives like GPT-4o-mini or DeepSeek run a fraction of that. For classification, routing, and formatting tasks, the cheaper model often matches the expensive one. But &#8220;often&#8221; isn&#8217;t &#8220;always,&#8221; and you need per-task quality metrics to know which calls can safely move down.</p><p>These aren&#8217;t exotic findings. They&#8217;re the baseline state of most production AI systems. The waste is there whether you can see it or not. Metering just makes it visible.</p><h2>Why &#8220;we&#8217;ll optimize later&#8221; never happens</h2><p>I&#8217;ve talked to dozens of engineering leads who have &#8220;LLM cost optimization&#8221; on their roadmap. Not one has a sprint ticket for it.</p><p>The problem is activation energy. Without metering, optimization has no clear starting point. You&#8217;d need to audit every LLM call, figure out which models each endpoint uses, estimate traffic per route, benchmark alternatives, and build evaluation infrastructure to measure quality impact. That&#8217;s weeks of work with uncertain payoff spread across dozens of endpoints.</p><p>The mental math writes itself: &#8220;Two weeks analyzing costs for maybe 30% savings, or two weeks shipping a feature customers are asking for.&#8221; The feature wins every time.</p><p>Metering collapses this. When you can see that one chatbot feature costs $18,000/month on an overpowered model, the optimization becomes a two-hour task, not a two-week audit. The gap between &#8220;massive research project&#8221; and &#8220;swap a model string and validate&#8221; is just visibility.</p><p>The teams that optimize aren&#8217;t more disciplined. They instrumented earlier.</p><h2>The uncomfortable parallel</h2><p>Every company that&#8217;s been through cloud cost optimization knows this pattern. The early days of AWS were identical: teams spun up instances, ran workloads, and got a single bill at the end of the month. &#8220;Cloud spending&#8221; was one line item. It took years of tooling before teams could manage what they were spending. Cost allocation tags. Reserved instance planning. Right-sizing recommendations. Each layer made the next optimization possible.</p><p>LLM costs are at that same inflection point. The difference is speed. Cloud cost maturity took a decade. <a href="https://menlovc.com/perspective/2025-mid-year-llm-market-update/">Enterprise LLM spending more than doubled in six months</a>, from $3.5 billion to $8.4 billion. The scrutiny is coming faster than the tooling.</p><p>The teams that instrument now will be ready. The teams that don&#8217;t will keep promising finance they&#8217;ll optimize &#8220;next sprint,&#8221; knowing they can&#8217;t start what they can&#8217;t see.</p>]]></content:encoded></item><item><title><![CDATA[AI Optimization Is a Game of Whack-a-Mole]]></title><description><![CDATA[The measurement problem hiding inside your optimization work]]></description><link>https://blog.jetty.io/p/ai-optimization-is-a-game-of-whack</link><guid isPermaLink="false">https://blog.jetty.io/p/ai-optimization-is-a-game-of-whack</guid><dc:creator><![CDATA[Jonathan Lebensold]]></dc:creator><pubDate>Thu, 26 Feb 2026 03:59:42 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ZLPL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca834b51-2cb4-426e-b924-500c3cab96e4_1408x768.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I watched a team spend three months optimizing their AI pipeline. Good engineers, plenty of budget. They&#8217;d fix one thing and something else would break. Latency improved but quality dropped. Quality recovered but costs crept back up. They&#8217;d tune the retrieval step and the summarization step would start hallucinating.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZLPL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca834b51-2cb4-426e-b924-500c3cab96e4_1408x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZLPL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca834b51-2cb4-426e-b924-500c3cab96e4_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!ZLPL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca834b51-2cb4-426e-b924-500c3cab96e4_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!ZLPL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca834b51-2cb4-426e-b924-500c3cab96e4_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!ZLPL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca834b51-2cb4-426e-b924-500c3cab96e4_1408x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZLPL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca834b51-2cb4-426e-b924-500c3cab96e4_1408x768.jpeg" width="1408" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ca834b51-2cb4-426e-b924-500c3cab96e4_1408x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:967665,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://lebensold.substack.com/i/189214931?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca834b51-2cb4-426e-b924-500c3cab96e4_1408x768.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZLPL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca834b51-2cb4-426e-b924-500c3cab96e4_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!ZLPL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca834b51-2cb4-426e-b924-500c3cab96e4_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!ZLPL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca834b51-2cb4-426e-b924-500c3cab96e4_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!ZLPL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca834b51-2cb4-426e-b924-500c3cab96e4_1408x768.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>`</p><p>After three months, I asked them to show me the net improvement. They couldn&#8217;t. Not because it was negative, but because it was unmeasurable. They had no baseline from before the work started, no way to tell whether the system was better than it was ninety days ago.</p><p>They&#8217;d been playing whack-a-mole.</p><h2>The pattern</h2><p>Here&#8217;s how it usually goes.</p><p>A team ships an AI feature. It works. Users like it. Leadership wants it faster, cheaper, more reliable. So an engineer starts optimizing.</p><p>They look at costs. For example, GPT-5 might be seen as expensive, so they swap the classification step to GPT-4o-mini. It&#8217;s cheaper, it&#8217;s fast, and for classification the quality should be comparable. They test against a handful of examples. Everything looks good.</p><p>A week later, a support ticket comes in. The classifier is miscategorizing a specific type of input that the old model handled correctly. Nobody connects the ticket to the model swap because it doesn&#8217;t look like a regression. It looks like a new bug.</p><p>Meanwhile, another engineer shortens the generation prompt to reduce token costs. Average outputs improve. They&#8217;re more concise, less repetitive. But a long tail of edge cases that the verbose prompt used to handle now produce garbled responses. The team doesn&#8217;t notice because they&#8217;re testing against examples they assembled two months ago, not against the inputs the system actually sees.</p><p>A month later, someone upgrades the embedding model to improve retrieval quality. Retrieval gets better. But the downstream summarizer was tuned for the old embedding distribution, and now it sometimes misinterprets what the retrieval step returns. The summarizer didn&#8217;t change. The retrieval step didn&#8217;t break. The interface between them shifted, and nobody was watching.</p><p>Each individual change was defensible. Each one made some metric better. And the system, in aggregate, isn&#8217;t meaningfully different from where it started.</p><h2>Why this happens</h2><p>The whack-a-mole pattern has a few root causes, and they compound.</p><p><strong>No baselines.</strong> Most teams don&#8217;t snapshot their system&#8217;s performance before starting optimization work. They have a vague sense that it costs too much or quality isn&#8217;t where it should be, but no structured measurement to compare against. Without a baseline, improvement is an impression, not a fact.</p><p><strong>Stale evaluation sets.</strong> The examples teams test against are usually assembled once, early in development, and rarely updated. They represent what the system used to see, not what it sees now. Real user inputs drift. New edge cases emerge. The evaluation set becomes a shared fiction that everyone treats as ground truth.</p><p><strong>Invisible dependencies.</strong> AI pipelines aren&#8217;t linear. A change to retrieval affects generation. A change to the prompt affects how the model uses retrieved context. A change to the model affects how it interprets the prompt. These coupling effects are hard to predict and easy to miss when you&#8217;re only measuring the step you changed.</p><p>And then there&#8217;s the one that deserves its own section.</p><h2>The model swap trap</h2><p>Many teams I&#8217;ve worked with has done some version of this: a new model comes out, it&#8217;s cheaper or faster or scores higher on a benchmark, so they swap it in. The public leaderboard says it&#8217;s better. The provider&#8217;s blog post says it&#8217;s better. The handful of test cases they run say it&#8217;s better.</p><p>It isn&#8217;t. Not universally.</p><p>Models don&#8217;t improve uniformly across all capabilities. A model that scores higher on MMLU might handle tool calls differently. A model that&#8217;s faster might be more terse, great for chat but terrible for document generation. A model trained on newer data might have different failure modes than the one you spent weeks tuning prompts for.</p><p>The problem isn&#8217;t that the new model is worse. &#8220;Better&#8221; is a distribution, not a scalar. The new model is better on average and worse on a long tail of edge cases your evaluation set doesn&#8217;t cover, because it was built before those edge cases existed.</p><p>This is why teams end up in whack-a-mole. They make a change that&#8217;s net positive on the metrics they track and net negative on metrics they don&#8217;t. Then they fix the newly visible problem, which introduces a new invisible one.</p><h2>We&#8217;ve been here before</h2><p>If you were writing software in the early 2000s, this pattern has a familiar shape.</p><p>Before continuous integration, teams developed in isolation for weeks, then merged everything together and prayed. &#8220;Works on my machine&#8221; was the official status report. Integration day was a reckoning. Bugs appeared at the boundaries between components, and nobody could tell whose change caused which failure.</p><p>CI solved this by making integration continuous. Every change was tested against the whole system immediately. You didn&#8217;t discover boundary failures at the end. You discovered them when they were introduced, when the context was fresh and the blast radius was small.</p><p>AI teams are in the pre-CI era right now. They develop changes in isolation, test against static evaluation sets, and deploy into a system where everything else has also changed. The boundary failures show up days or weeks later, disconnected from the change that caused them.</p><p>The fix is the same one. A continuous loop where every change is evaluated against the current state of the whole system, using data that reflects what the system actually encounters in production.</p><h2>What the exit looks like</h2><p>The exit from whack-a-mole isn&#8217;t working harder or being smarter about which changes to make. It&#8217;s building the measurement infrastructure that tells you whether your changes are helping.</p><p><strong>Start with a real baseline.</strong> Before you optimize anything, measure your system end-to-end against production-representative data. Cost per trace. Quality per step. Error rates by input type. Latency distributions. This is your starting line.</p><p><strong>Evaluate the system, not the step.</strong> When you swap a model or change a prompt, don&#8217;t just test that step in isolation. Run the whole pipeline. The dependencies between steps are where whack-a-mole lives.</p><p><strong>Refresh your evaluation data.</strong> Your test set should be a living sample of what your system actually sees, not a museum of what it used to see. Pull from production traces. Include the weird inputs, the edge cases, the failures. If your evaluation set hasn&#8217;t changed in a month, it&#8217;s already lying to you.</p><p><strong>Measure before and after every change.</strong> Almost nobody does this. Run your evaluation suite, make the change, run it again. If you can&#8217;t show a net improvement across the metrics that matter, the change isn&#8217;t an improvement. It&#8217;s a lateral move.</p><p><strong>Track the aggregate.</strong> Individual step metrics are useful but insufficient. You need a system-level view: cost per successful outcome, end-to-end quality, cumulative error rates. If step metrics go up but system metrics stay flat, you&#8217;re playing whack-a-mole with extra steps.</p><h2>The uncomfortable truth</h2><p>The teams I&#8217;ve seen break out of the whack-a-mole cycle aren&#8217;t the ones with the best engineers or the most sophisticated models. They&#8217;re the ones that invested in measurement before they invested in optimization.</p><p>That&#8217;s a hard sell. When leadership wants costs down by next quarter, the natural response is to start cutting. Swap a model, shorten a prompt, add a cache. Each change feels productive. The dashboard numbers move. But &#8220;productive&#8221; and &#8220;effective&#8221; are different things when you have no infrastructure to tell them apart.</p><p>The question worth asking isn&#8217;t &#8220;what should we optimize next?&#8221; It&#8217;s &#8220;do we have any way to know if our last optimization actually worked?&#8221;</p><p>If the answer is no, that&#8217;s the first thing to fix.</p>]]></content:encoded></item><item><title><![CDATA[Foundation Models Ship Like Windows 98]]></title><description><![CDATA[AI shipped Big Bang releases back to production]]></description><link>https://blog.jetty.io/p/foundation-models-ship-like-windows</link><guid isPermaLink="false">https://blog.jetty.io/p/foundation-models-ship-like-windows</guid><dc:creator><![CDATA[Jonathan Lebensold]]></dc:creator><pubDate>Tue, 17 Feb 2026 03:30:32 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!zydN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5da51024-6ba4-46b2-9fe4-8a577de05d9b_1408x768.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Early in my career, I watched coworkers dread Oracle upgrades the way you&#8217;d dread a root canal. Weeks before a release, the mood would shift. Development froze. Test plans ran to a hundred pages. Someone would pick a Friday night, and the whole team would brace for impact. If the upgrade failed, Saturday was a rollback. If it succeeded, Monday was triage.</p><p>We called these Big Bang releases. Months of change compressed into a single explosive moment, with everyone hoping the blast didn&#8217;t take production down with it.</p><p>The pressure was the part that stayed with me. As release dates approached, you could feel it building across the team. Not excitement. Dread. The kind of tension that comes from knowing you&#8217;ve accumulated so much change that nobody can predict what will happen when you flip the switch.</p><p>The software industry spent the next fifteen years dismantling this approach. <a href="https://martinfowler.com/articles/continuousIntegration.html">Continuous integration</a>. Automated testing. Feature flags. Deploy-on-merge.</p><p>The early adopters made the rest of us look like we were standing still. In 2009, Flickr&#8217;s John Allspaw and Paul Hammond got on stage at Velocity and announced they were <a href="https://www.slideshare.net/slideshow/10-deploys-per-day-dev-and-ops-cooperation-at-flickr/1628368">deploying to production ten or more times a day</a>. That talk kicked off the DevOps movement. Etsy took it further &#8212; new engineers shipped code to production on their first day, and by 2014 they were pushing 80+ deploys daily with 150 engineers. Amazon hit a deploy every 11.6 seconds by 2011. Netflix open-sourced <a href="https://spinnaker.io/">Spinnaker</a> and built a deployment culture so confident they&#8217;d randomly kill production services just to prove they could recover.</p><p>By 2015, a team of two thousand engineers could push hundreds of commits a day, each one tested, deployed, and monitored independently. The Big Bang was dead.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zydN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5da51024-6ba4-46b2-9fe4-8a577de05d9b_1408x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zydN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5da51024-6ba4-46b2-9fe4-8a577de05d9b_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!zydN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5da51024-6ba4-46b2-9fe4-8a577de05d9b_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!zydN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5da51024-6ba4-46b2-9fe4-8a577de05d9b_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!zydN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5da51024-6ba4-46b2-9fe4-8a577de05d9b_1408x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zydN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5da51024-6ba4-46b2-9fe4-8a577de05d9b_1408x768.jpeg" width="1408" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5da51024-6ba4-46b2-9fe4-8a577de05d9b_1408x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:642067,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://lebensold.substack.com/i/188217167?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5da51024-6ba4-46b2-9fe4-8a577de05d9b_1408x768.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zydN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5da51024-6ba4-46b2-9fe4-8a577de05d9b_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!zydN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5da51024-6ba4-46b2-9fe4-8a577de05d9b_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!zydN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5da51024-6ba4-46b2-9fe4-8a577de05d9b_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!zydN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5da51024-6ba4-46b2-9fe4-8a577de05d9b_1408x768.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Or so I thought.</p><h2>Shrink-wrap AI</h2><p>Watch how foundation model providers ship. OpenAI releases GPT-4, then GPT-4 Turbo, then GPT-4o. Anthropic releases Claude 3, then 3.5, then 4. Each version is a major event with a blog post, a benchmark table, and a wave of developers scrambling to figure out what changed.</p><p>This is shrink-wrap software. It&#8217;s the same mentality that gave us Oracle 8i and Windows ME: build it behind closed doors, stamp a version number on it, and push it out the door. The release itself is the artifact. Everything that happened between versions is invisible to the people who depend on the product.</p><p>For the model providers, this might be unavoidable. Training runs are expensive, and you can&#8217;t exactly deploy a half-trained model to production. But for every team building on top of these models, inheriting the shrink-wrap mentality is a choice. And it&#8217;s the wrong one.</p><h2>The second Big Bang</h2><p>Here&#8217;s what I see in practice. A team builds an AI feature against GPT-4. They test it, tune the prompts, get the outputs to an acceptable quality. Ship it. Months pass. OpenAI deprecates the model version. Or a new model comes out that&#8217;s cheaper and supposedly better.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HuLJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F411481d0-85d7-4238-a518-30602a2b8169_1408x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HuLJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F411481d0-85d7-4238-a518-30602a2b8169_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!HuLJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F411481d0-85d7-4238-a518-30602a2b8169_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!HuLJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F411481d0-85d7-4238-a518-30602a2b8169_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!HuLJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F411481d0-85d7-4238-a518-30602a2b8169_1408x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HuLJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F411481d0-85d7-4238-a518-30602a2b8169_1408x768.jpeg" width="1408" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/411481d0-85d7-4238-a518-30602a2b8169_1408x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:801803,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lebensold.substack.com/i/188217167?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F411481d0-85d7-4238-a518-30602a2b8169_1408x768.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HuLJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F411481d0-85d7-4238-a518-30602a2b8169_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!HuLJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F411481d0-85d7-4238-a518-30602a2b8169_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!HuLJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F411481d0-85d7-4238-a518-30602a2b8169_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!HuLJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F411481d0-85d7-4238-a518-30602a2b8169_1408x768.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Now the team faces a familiar decision: migrate everything at once or fall behind. They pick a weekend. They swap the model. They rerun their evaluation suite, which is a gold dataset they assembled six months ago, and the numbers look fine.</p><p>Monday morning, support tickets start rolling in. The new model handles tool calls differently. It&#8217;s more verbose in some cases, terser in others. Edge cases that the old model handled gracefully now produce hallucinations. The evaluation suite didn&#8217;t catch any of this because it was testing against a frozen snapshot of reality, not reality itself.</p><p>This is the Big Bang release, reborn. Different technology, same failure mode.</p><h2>What CI actually solved</h2><p>The insight behind continuous integration wasn&#8217;t just &#8220;deploy more often.&#8221; It was that small, incremental changes are fundamentally easier to reason about than large ones.</p><p>When you deploy a single commit, you know exactly what changed. If something breaks, you know where to look. When you deploy six months of accumulated changes in one shot, you&#8217;re debugging a combinatorial explosion. Every change interacts with every other change, and the failure could be anywhere in the stack.</p><p>CI worked because it made each change small enough to understand, test, and roll back independently. It didn&#8217;t eliminate risk. It made risk manageable.</p><h2>AI needs the same loop</h2><p>The parallel to AI systems is direct, but the loop looks different.</p><p>In traditional CI, the cycle is: write code, run tests, deploy, monitor, repeat. The code changes. The tests and infrastructure stay relatively stable.</p><p>In AI systems, everything moves. The model changes when providers ship updates. The data changes as users interact with your system. The prompts change as your team iterates. The retrieval corpus changes as new documents get indexed. You can&#8217;t hold any of it still long enough to test it the way you&#8217;d test a REST endpoint.</p><p>So what does CI for AI actually look like?</p><p><strong>Grab production data.</strong> Not a curated test set from six months ago. Actual inputs and outputs from your running system. This is the raw signal that tells you how your system behaves in the real world.</p><p><strong>Sanitize and transform.</strong> Strip PII, anonymize where needed, handle compliance requirements. This is non-trivial but it&#8217;s infrastructure, not a blocker. Teams that treat privacy as a reason to avoid production data entirely end up flying blind.</p><p><strong>Label and categorize.</strong> Not everything needs human labeling. Cluster similar inputs. Flag anomalies. Use LLM-as-judge for initial quality assessments. Build a living picture of what your system handles well and where it struggles.</p><p><strong>Benchmark against the next iteration.</strong> Before you swap a model, change a prompt, or update your retrieval pipeline, run the proposed change against your current production-derived evaluation set. Not a public leaderboard. Not an academic benchmark. Your data, your users, your edge cases.</p><p><strong>Deploy incrementally.</strong> Don&#8217;t swap everything at once. <a href="https://martinfowler.com/bliki/CanaryRelease.html">Canary</a> the change. Route 5% of traffic to the new configuration. Compare quality metrics side by side. Expand or roll back based on evidence, not hope.</p><p><strong>Feed the results back.</strong> The outputs from this cycle become the inputs to the next one. New failure modes get added to the evaluation set. Successful optimizations become the new baseline. The loop tightens with every iteration.</p><p>This is the CI/CD loop applied to AI. It&#8217;s not a metaphor. It&#8217;s the same engineering discipline, adapted for a system where both the code and the data are moving targets.</p><h2>Why teams resist this</h2><p>The most common objection is that it&#8217;s too much infrastructure for the payoff. &#8220;We&#8217;re a small team. We just need to ship features.&#8221;</p><p>I heard the exact same argument against automated testing in 2008. Teams that skipped CI shipped faster for a quarter, then spent the next year debugging integration failures and botched releases. The teams that invested early moved slower at first but compounded their advantage over time.</p><p>The second objection is that AI is different. Models are black boxes. You can&#8217;t unit test a neural network the way you test a function.</p><p>This is true and also beside the point. You don&#8217;t need to unit test the model. You need to continuously evaluate the system: the model plus the prompts plus the retrieval plus the post-processing plus the guardrails. The system is what your users interact with. The system is what you can measure and improve.</p><h2>The compounding advantage</h2><p>Teams that close this loop gain something that&#8217;s hard to replicate: a continuously improving evaluation set derived from their actual production traffic. Every week, their benchmarks get more representative. Every iteration catches more edge cases. Every deployment carries less risk because the safety net is woven from real-world data, not synthetic test cases.</p><p>Teams that don&#8217;t close the loop are stuck in the 90s. They test against stale datasets. They deploy model swaps as Big Bang releases. They discover problems from support tickets instead of automated analysis. Each deployment is a gamble, and the odds don&#8217;t improve with time.</p><h2>The gap is closing</h2><p>The tooling for AI CI/CD is maturing fast. Platforms like <a href="https://langfuse.com/">Langfuse</a> give you the trace data. LLM-as-judge frameworks handle automated quality assessment. Prompt management tools support versioning and A/B testing. The pieces exist.</p><p>What&#8217;s missing for most teams isn&#8217;t tooling. It&#8217;s the mindset shift. The recognition that AI systems aren&#8217;t something you build, test, and ship. They&#8217;re something you continuously operate, measure, and improve. That&#8217;s what CI meant for software. It&#8217;s what CI needs to mean for AI.</p><p>The software industry took a decade to move from Big Bang releases to <a href="https://continuousdelivery.com/">continuous delivery</a>. AI systems are being deployed to millions of people right now. We don&#8217;t have a decade. But we do have the playbook. We&#8217;ve done this before.</p><p>The question is whether your team is running the 2025 version of the loop, or the 1998 one.</p>]]></content:encoded></item><item><title><![CDATA[Stop Building Against Gold Datasets]]></title><description><![CDATA[Why frozen benchmarks can't measure living systems]]></description><link>https://blog.jetty.io/p/stop-building-against-gold-datasets</link><guid isPermaLink="false">https://blog.jetty.io/p/stop-building-against-gold-datasets</guid><dc:creator><![CDATA[Jonathan Lebensold]]></dc:creator><pubDate>Sun, 15 Feb 2026 15:55:13 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!FDWx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bbe4a2d-629c-44f5-a90e-6f086ef11d3c_1584x672.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In grad school, the first thing you learn in any machine learning course is to find a good dataset. MNIST for vision. SQuAD for reading comprehension. GLUE for language understanding. These are &#8220;gold datasets&#8221;: carefully curated, cleanly labeled, academically blessed. They exist so you can focus on the algorithm and avoid getting tangled in messy data problems.</p><p>This makes sense in a classroom. It makes no sense in production. Gold datasets teach you to think about data as something you obtain once, hold still, and measure against. That assumption breaks the moment your system touches real users.</p><h2>The dream of separating code from data</h2><p>Software engineering spent decades pulling code and data apart. FORTRAN had DATA statements that hardcoded values directly into the program. COBOL formalized the separation with its DATA DIVISION. From there, the trajectory was consistent: relational databases, config files, environment variables. The <a href="https://12factor.net/config">Twelve-Factor App</a> made it doctrine: &#8220;strict separation of config from code.&#8221;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FDWx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bbe4a2d-629c-44f5-a90e-6f086ef11d3c_1584x672.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FDWx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bbe4a2d-629c-44f5-a90e-6f086ef11d3c_1584x672.jpeg 424w, https://substackcdn.com/image/fetch/$s_!FDWx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bbe4a2d-629c-44f5-a90e-6f086ef11d3c_1584x672.jpeg 848w, https://substackcdn.com/image/fetch/$s_!FDWx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bbe4a2d-629c-44f5-a90e-6f086ef11d3c_1584x672.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!FDWx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bbe4a2d-629c-44f5-a90e-6f086ef11d3c_1584x672.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FDWx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bbe4a2d-629c-44f5-a90e-6f086ef11d3c_1584x672.jpeg" width="1456" height="618" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3bbe4a2d-629c-44f5-a90e-6f086ef11d3c_1584x672.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:618,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:725801,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://lebensold.substack.com/i/188011489?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bbe4a2d-629c-44f5-a90e-6f086ef11d3c_1584x672.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FDWx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bbe4a2d-629c-44f5-a90e-6f086ef11d3c_1584x672.jpeg 424w, https://substackcdn.com/image/fetch/$s_!FDWx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bbe4a2d-629c-44f5-a90e-6f086ef11d3c_1584x672.jpeg 848w, https://substackcdn.com/image/fetch/$s_!FDWx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bbe4a2d-629c-44f5-a90e-6f086ef11d3c_1584x672.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!FDWx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bbe4a2d-629c-44f5-a90e-6f086ef11d3c_1584x672.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Then foundation models arrived and broke the assumption entirely.</p><p>An LLM&#8217;s behavior is inseparable from the data it was trained on, the examples in its prompt, the documents retrieved by its RAG pipeline. Change any of these and you change what the system does. The data <em>is</em> the code. We&#8217;ve come full circle, back to FORTRAN&#8217;s DATA block, except now the data block is a few billion parameters and a prompt template that gets rewritten every sprint.</p><h2>What gold datasets actually measure</h2><p>Here&#8217;s the uncomfortable truth about the benchmarks the ML community treats as ground truth: they&#8217;re often wrong.</p><p>A <a href="https://arxiv.org/html/2406.04127v1">2024 analysis of MMLU</a>, the benchmark most commonly cited when comparing LLM capabilities, found that 6.5% of questions contain errors. In the virology subset, 57% of questions had problems: wrong answers, ambiguous phrasing, missing context. The maximum achievable score isn&#8217;t 100%, and nobody agrees on what it actually is.</p><p>When researchers <a href="https://arxiv.org/abs/1707.07328">added adversarial distractor sentences</a> to SQuAD reading comprehension passages, model accuracy dropped from 75% to 36%. With ungrammatical adversarial sequences, it fell to 7%. The models weren&#8217;t reading. They were pattern-matching against a dataset they&#8217;d learned to game.</p><p>The deeper problem is Goodhart&#8217;s Law: when a measure becomes a target, it ceases to be a good measure. Models that score 87% on <a href="https://www.codeant.ai/blogs/test-llm-performance-real-code">HumanEval</a> for code generation drop to around 30% accuracy on real-world codebases with cross-file dependencies, internal frameworks, and legacy patterns. The benchmark says the model is excellent. Production says otherwise.</p><h2>The team tension nobody talks about</h2><p>I watched this dynamic play out repeatedly before and while building <a href="https://www.jetty.io">Jetty</a>.</p><p>The data science team builds a pipeline against a clean CSV from six months ago. The data is tidy. The labels are consistent. The model performs beautifully. Meanwhile, the developer integrating it into production knows that real inputs look nothing like that CSV. Fields are missing. Formats are inconsistent. Users submit things nobody anticipated.</p><p>Both sides are acting rationally. The data scientist needs controlled conditions to iterate on the model. The developer needs to ship something that works for actual users. The gold dataset becomes a shared fiction: everyone references it, nobody fully trusts it, and the gap between lab performance and production reality widens with every sprint.</p><p>This isn&#8217;t a communication problem. It&#8217;s a structural one. Gold datasets encode the assumption that you can fix your data, fix your model, and measure once. AI systems break that assumption. The data distributions shift, the models get updated, the retrieval systems change, the code evolves. A frozen dataset tells you how the system performed against a historical snapshot, not how it performs right now.</p><h2>Data collection is not a phase</h2><p>The alternative isn&#8217;t &#8220;better gold datasets.&#8221; It&#8217;s abandoning the concept entirely and treating data collection as a continuous, live process wired into production.</p><p><a href="https://arxiv.org/abs/2403.16795">Shreya Shankar&#8217;s research</a> put it bluntly: &#8220;We have no idea how models will behave in production until production.&#8221; Her interviews with ML engineers across chatbots, autonomous vehicles, and finance found a consistent pattern: the teams that succeed close the loop fastest, continually cycling between data collection, experimentation, staged evaluation, and monitoring.</p><p>This is the <a href="https://www.nvidia.com/en-us/glossary/data-flywheel/">data flywheel</a>. Your product generates data. You analyze that data to find where the system fails. You fold those failures back into your evaluation sets and training data. Each cycle makes the next one more valuable, because you&#8217;re measuring against reality, not a proxy for it.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hyZM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bb6c3cb-3c8a-4238-8389-ae1335eb707e_1408x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hyZM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bb6c3cb-3c8a-4238-8389-ae1335eb707e_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!hyZM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bb6c3cb-3c8a-4238-8389-ae1335eb707e_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!hyZM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bb6c3cb-3c8a-4238-8389-ae1335eb707e_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!hyZM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bb6c3cb-3c8a-4238-8389-ae1335eb707e_1408x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hyZM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bb6c3cb-3c8a-4238-8389-ae1335eb707e_1408x768.jpeg" width="1408" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1bb6c3cb-3c8a-4238-8389-ae1335eb707e_1408x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:398355,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lebensold.substack.com/i/188011489?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bb6c3cb-3c8a-4238-8389-ae1335eb707e_1408x768.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hyZM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bb6c3cb-3c8a-4238-8389-ae1335eb707e_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!hyZM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bb6c3cb-3c8a-4238-8389-ae1335eb707e_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!hyZM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bb6c3cb-3c8a-4238-8389-ae1335eb707e_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!hyZM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bb6c3cb-3c8a-4238-8389-ae1335eb707e_1408x768.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><a href="https://hamel.dev/blog/posts/evals/">Hamel Husain</a> makes a practical version of this argument: start with error analysis. Look at your production data. Categorize failures. Write evals that catch the failures you&#8217;ve actually seen. He warns against chasing high pass rates. If you&#8217;re at 100%, you&#8217;re not stress-testing your system. A 70% on meaningful, production-derived evals tells you more than 95% on a benchmark that doesn&#8217;t reflect your users.</p><h2>But what about PII?</h2><p>The main objection I hear is privacy. &#8220;We can&#8217;t use production data &#8212; it contains PII. Regulatory compliance makes this impossible.&#8221;</p><p>This is real. But it&#8217;s not a reason to default to stale snapshots. It&#8217;s a reason to invest in the infrastructure that makes live data collection safe. Anonymization pipelines. Differential privacy. Synthetic data generation from production distributions. None of this is trivial, but the alternative is worse. You&#8217;re not protecting users by ignoring how your system behaves in the wild. You&#8217;re deferring the risk.</p><p>The privacy experts I&#8217;ve worked with are far more worried about uninstrumented systems that can&#8217;t detect when they fail on a vulnerable population than about well-designed systems that safely collect production signals. Avoiding production data entirely isn&#8217;t caution. It&#8217;s the opposite of it.</p><h2>The first step</h2><p>If your team is still building against a gold dataset, here&#8217;s where to start: pick one pipeline that matters, instrument it to capture real inputs and outputs, and run your existing evals against production data instead of your test set. You will be surprised by the gap. Production is weirder, messier, and more varied than any curated dataset can capture. That gap is the information you need. Gold datasets hide it. Live data reveals it.</p><p>The question isn&#8217;t whether your gold dataset is good enough. It&#8217;s whether you can afford to keep pretending that a frozen snapshot tells you anything useful about a system that never stops changing.</p>]]></content:encoded></item><item><title><![CDATA[Observability Won’t Save Your AI System]]></title><description><![CDATA[Moving beyond dashboards]]></description><link>https://blog.jetty.io/p/observability-wont-save-your-ai-system</link><guid isPermaLink="false">https://blog.jetty.io/p/observability-wont-save-your-ai-system</guid><dc:creator><![CDATA[Jonathan Lebensold]]></dc:creator><pubDate>Fri, 13 Feb 2026 15:16:22 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!pCvK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f9e178-0c8d-4392-a3fd-50c334a4bc7f_1120x1060.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>72% of API calls were redundant. The data was sitting right there in Langfuse &#8212; every trace, every duplicate input, every wasted dollar &#8212; for months. Nobody noticed.</p><p>This wasn&#8217;t a startup running on duct tape. It was a production extraction service with proper instrumentation, dashboards, alerts. They had observability. What they didn&#8217;t have was anyone systematically analyzing what the observability was showing them.</p><p>I see teams invest real effort getting traces flowing into Langfuse or Arize or whatever platform they&#8217;ve chosen. They build dashboards. They set up alerts on latency and error rates. Then they treat the problem as solved.</p><p>It isn&#8217;t.</p><h2>The dashboard paradox</h2><p>Dashboards are great at confirming what you already suspect:</p><ul><li><p>Latency spiking? Check the dashboard.</p></li><li><p>Heard about an outage? Check the dashboard.</p></li></ul><p>But dashboards are terrible at surfacing what you don&#8217;t know to look for.</p><p>The duplicate calls weren&#8217;t hiding. They were in plain sight &#8212; scattered across thousands of individual traces. Any engineer could have pulled the data, grouped by input hash, and seen the duplication. But why would they? The system was working. Requests went in, responses came out. The dashboard showed green.</p><p>When was the last time you looked at your traces and found something you didn&#8217;t already know?</p><p>That&#8217;s the gap. Not data collection &#8212; the tools handle that well. The gap is between <em>having</em> data and <em>acting</em> on data. Between seeing traces individually and understanding what they mean in aggregate.</p><h2>We&#8217;ve been here before</h2><p>The software industry went through this exact evolution over the past two decades.</p><p><strong>First came logs.</strong> Teams shipped code and hoped for the best. When something broke, you SSH&#8217;d into a box and tailed a log file. I spent years convincing engineering leaders that widespread monitoring was worth the investment. Eventually it became table stakes.</p><p><strong>Then came the dashboards:</strong> Nagios, Munin, then Datadog and New Relic. CPU, memory, request rates, error counts. You could see problems faster. But you still had to know what to look for.</p><p>More recently, we have <strong>open telemetry and APM</strong> (application performance monitoring). Tools that don&#8217;t just collect metrics but trace requests end-to-end, correlated events across services, and surface anomalies automatically. The shift wasn&#8217;t more data: it was <strong>smarter analysis of the data you already had</strong>.</p><p>And finally, <strong>automated remediation</strong>. Auto-scaling, self-healing infrastructure, chaos engineering. Systems that didn&#8217;t just detect problems but responded to them. Each layer built on the one before it. Nobody skipped from logs straight to auto-remediation. But nobody stopped at logs either.</p><h2>AI observability is stuck at layer one</h2><p>Most AI teams today are somewhere between logs and monitoring. They&#8217;ve got traces flowing. They&#8217;ve got dashboards. Some have alerts. This is genuinely important work &#8212; platforms like Langfuse have made it dramatically easier to see what&#8217;s happening inside LLM-powered systems.</p><p>But it&#8217;s layer one.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pCvK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f9e178-0c8d-4392-a3fd-50c334a4bc7f_1120x1060.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pCvK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f9e178-0c8d-4392-a3fd-50c334a4bc7f_1120x1060.png 424w, https://substackcdn.com/image/fetch/$s_!pCvK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f9e178-0c8d-4392-a3fd-50c334a4bc7f_1120x1060.png 848w, https://substackcdn.com/image/fetch/$s_!pCvK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f9e178-0c8d-4392-a3fd-50c334a4bc7f_1120x1060.png 1272w, https://substackcdn.com/image/fetch/$s_!pCvK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f9e178-0c8d-4392-a3fd-50c334a4bc7f_1120x1060.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pCvK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f9e178-0c8d-4392-a3fd-50c334a4bc7f_1120x1060.png" width="586" height="554.6071428571429" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b5f9e178-0c8d-4392-a3fd-50c334a4bc7f_1120x1060.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1060,&quot;width&quot;:1120,&quot;resizeWidth&quot;:586,&quot;bytes&quot;:137536,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lebensold.substack.com/i/187818513?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f9e178-0c8d-4392-a3fd-50c334a4bc7f_1120x1060.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pCvK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f9e178-0c8d-4392-a3fd-50c334a4bc7f_1120x1060.png 424w, https://substackcdn.com/image/fetch/$s_!pCvK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f9e178-0c8d-4392-a3fd-50c334a4bc7f_1120x1060.png 848w, https://substackcdn.com/image/fetch/$s_!pCvK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f9e178-0c8d-4392-a3fd-50c334a4bc7f_1120x1060.png 1272w, https://substackcdn.com/image/fetch/$s_!pCvK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f9e178-0c8d-4392-a3fd-50c334a4bc7f_1120x1060.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A single trace looks fine. Ten thousand traces reveal that 40% of your spend goes to resending the same system prompt on every conversational turn. Your error rate is 5% overall &#8212; sounds acceptable &#8212; until you break it apart and find one pipeline step fails 40% of the time, masked by the steps that never fail. You know your monthly LLM bill, but not which workflow drives it, which model version inflates it, or which operations could drop to a cheaper model without quality loss.</p><p>Your system worked great three months ago. Something changed &#8212; a model version, a data distribution, a prompt template &#8212; and quality degraded slowly enough that no alert fired. Layer one doesn&#8217;t watch for gradual shifts. It watches for threshold breaches.</p><h2>The action gap</h2><p>I&#8217;ve analyzed traces from lots of production AI systems and the pattern is remarkably consistent: every project has observability in place, and every project has obvious optimization opportunities sitting unnoticed in the data.</p><p>The median finding is 30&#8211;60% cost savings from fixes that take less than an hour. A response cache here. A model downgrade there. A prompt caching config change. Pin a model version instead of running five old ones simultaneously.</p><p>These aren&#8217;t exotic fixes. They&#8217;re the kind of thing any senior engineer would implement immediately &#8212; if someone pointed them out. The problem isn&#8217;t capability. It&#8217;s attention. Nobody&#8217;s job is to stare at every trace and notice that the same input keeps showing up.</p><p>Observability tells you what happened. Analysis tells you what it means. <strong>The layer most teams are missing is the one that turns analysis into specific, prioritized actions.</strong></p><h2>What the next layer looks like</h2><p>The APM analogy isn&#8217;t just historical color. It&#8217;s predictive. The same evolutionary pressure that pushed software monitoring toward automated analysis is now pushing AI observability in the same direction.</p><p>Layer two is automated analysis. Not more dashboards &#8212; systematic examination of traces that surfaces patterns humans miss. Redundancy detection. Cost decomposition. Error clustering. Quality drift measurement. The kind of analysis you&#8217;d do if you had infinite time and perfect attention, run continuously against your production data.</p><p>Layer three is automated action. Analysis produces recommendations. Recommendations become pull requests. A system that doesn&#8217;t just tell you &#8220;you&#8217;re spending too much on model X&#8221; but opens a PR that swaps it for a cheaper alternative and shows you the quality comparison.</p><p>We&#8217;re not at layer three yet. But layer two is here, and most teams haven&#8217;t adopted it.</p><h2>The uncomfortable question</h2><p>When was the last time you did a systematic analysis of your traces? Not a dashboard check. Not an incident investigation. A deliberate, comprehensive review of what your system is actually doing.</p><p><strong>If the answer is &#8220;never&#8221; or &#8220;I&#8217;m not sure,&#8221; you&#8217;re not alone.</strong> That&#8217;s almost everyone. And it means there are patterns in your data &#8212; waste, errors, drift &#8212; that you haven&#8217;t found yet.</p><p>Observability was the right first step. But the teams that pull ahead won&#8217;t be the ones with the best dashboards. They&#8217;ll be the ones that close the gap between seeing and acting.</p>]]></content:encoded></item><item><title><![CDATA[We analyzed 100K+ Langfuse traces. Here’s what’s hiding in production.]]></title><description><![CDATA[Model version drift: the quiet tax doubling your OpenAI bill]]></description><link>https://blog.jetty.io/p/we-analyzed-100k-langfuse-traces</link><guid isPermaLink="false">https://blog.jetty.io/p/we-analyzed-100k-langfuse-traces</guid><dc:creator><![CDATA[Jonathan Lebensold]]></dc:creator><pubDate>Wed, 11 Feb 2026 15:19:28 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/d58f3322-49c6-427b-b149-1e2ac60750ce_1376x768.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>We&#8217;ve been helping folks with a variety of production Langfuse projects&#8212;including voice AI, agentic tooling, analytics platforms, healthcare pipelines, and data extraction services. Running over 100,000 traces, representing thousands of dollars in monthly LLM spend, through our automated analysis pipeline revealed findings that were far worse than we expected. </p><p>The problem wasn&#8217;t that any single project was performing an unusual task, but that <strong>the same patterns of inefficiency were consistently showing up</strong> everywhere. These issues were visible in the traces the entire time, yet they were sitting on dashboards that <strong>nobody was looking at closely enough</strong>.Most projects have a caching problem.</p><p><em>Connect your Langfuse and <a href="https://www.jetty.io">get a PR that saves you money with Jetty</a>.</em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.jetty.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Ground Truth! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>We discovered one extraction service that sent the exact same input to an LLM over 70% of the time. The same file, the same content, and the same result were repeatedly processed with no caching or deduplication in place. Each redundant call cost a few cents and took about a minute, when a simple cache lookup could have returned the result instantly.</p><p>Worse still, these redundant calls led to cascading failures. Burst traffic would overwhelm the upstream API, triggering timeout errors, with every single failure occurring on an input that had already been successfully processed dozens of times. The solution is a straightforward, hash-based response cache, often requiring maybe 30 lines of code.</p><p><strong>What to check:</strong> Are your LLMs repeatedly processing the same inputs? Implementing even simple hash-based deduplication can dramatically cut both spend and failure rates.System prompts are a hidden cost multiplier</p><p>For those using conversational AI frameworks like VAPI, it is easy to overspend without even realizing it. In multiple projects, we observed that every turn included the full system prompt&#8212;thousands of characters of business context and behavioral instructions&#8212;repeated identically each time. In a 50-turn phone conversation, that prompt is sent 50 times. Consequently, over 90% of the token spend in these projects was dedicated to resending the system prompt. The most expensive single conversation we found, despite not being particularly long, cost $0.36. Prompt caching, which has been available from OpenAI since late 2024 and is supported by most major providers, can cut input token costs by 60&#8211;70%. This represents a meaningful monthly saving achievable through a single configuration change.</p><p><strong>What to check:</strong> Examine your input-to-output token ratio. If input tokens dominate by 100:1 or more, it means you are likely paying to resend the same context with every turn.You&#8217;re probably using the wrong model</p><p>This issue appeared in two distinct forms across nearly every project we analyzed.</p><p><strong>Model version drift:</strong> Teams often pin model versions for stability but then forget about them. We found projects simultaneously running half a dozen versions of GPT-4o, with the older versions costing double the newer ones, despite delivering the same prompts and the same quality.</p><p><strong>Overpowered models for simple tasks:</strong> One project was using a frontier model for intent classification, a task that averages only 70 output tokens. A mini model could handle this at a mere fraction of the cost. Another project spent over 80% of its budget on the most expensive available model for operations that had no need for its advanced capabilities.</p><p><strong>What to check:</strong> Group your traces by model version and compare the associated costs. Then, look at your simplest operations&#8212;such as classification, routing, and formatting&#8212;and question whether they genuinely require your most capable model.Agentic workflows compound costs fast</p><p>Agentic workflows that accumulate context over multiple turns generate compounding costs that are difficult to isolate at the dashboard level. As each step adds to the context window, the 20th call in a chain processes far more tokens than the first. We saw traces where <strong>individual generations were processing over 100K input tokens</strong> by the end of a session. </p><p>The single most expensive trace across everything we analyzed&#8212;a single workflow execution&#8212;cost over $50.</p><p>This behavior is inherent to how agentic architectures function. However, <strong>without per-trace cost visibility broken down by step, you have no clear way to determine which specific operations are driving the bill</strong>.</p><p><strong>What to check:</strong> Focus on your most expensive traces, not just your averages. If your p99 cost is ten times your median, context accumulation is the probable cause. Solutions include summarizing intermediate context or routing later steps to cheaper models.Errors are hiding in your traces</p><p>One project logged a startling 134% error rate, meaning they had more errors than traces in a given month. These were not intermittent blips but systemic failures that had been running for weeks. Another pipeline was hitting 27-minute latencies on individual operations. These numbers&#8212;the error counts&#8212;were present in Langfuse the entire time, but no one had aggregated them by step in a way that made the severity obvious enough to prompt action.</p><p><strong>What to check:</strong> Aggregate your error rates by pipeline step, rather than just looking at the overall rate. An aggregate rate of 5% might be concealing a single step that fails 40% of the time.Some of your spend is probably invisible</p><p>Across the projects we analyzed, a significant portion of LLM calls showed a cost of $0 in Langfuse. This typically happens when calls are routed through Azure or other deployments that do not report usage data back. We even found one project with zero cost tracking altogether. You cannot optimize what you cannot see. If your Langfuse dashboard costs appear lower than your actual cloud bill, that discrepancy is not a saving&#8212;it is a blind spot.</p><p><strong>What to check:</strong> Compare your Langfuse-reported costs directly against your actual invoices. If there is a meaningful gap, determine which providers are failing to report usage data.The gap between observability and optimization</p><p>Every project we examined had observability in place; they could see their traces, latencies, and model usage. Yet, observability alone was insufficient to surface these problematic patterns. The redundant computation was visible in individual traces, but you had to look at all of them to spot the duplication. The model version tax was hidden in a column nobody thought to group by. The error rate was available, but not aggregated in a manner that compelled anyone to act.</p><p>The gap, therefore, is not in data collection&#8212;which Langfuse handles well&#8212;but in systematically analyzing that data. The difference is between merely seeing traces individually and understanding their collective meaning in aggregate. Through this analysis, we identified savings opportunities ranging from 30% to over 90% of current spend. Crucially, the fixes were not exotic: a response cache, a simple config change for prompt caching, pinning a model version, or routing simple tasks to smaller models. Most of these solutions required less than an hour of work.</p><p>If you are currently running LLM workloads through Langfuse and have not performed this type of recent analysis, you may be surprised by what is hidden in your traces.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.jetty.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Ground Truth! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[AI Breaks Assumptions Behind Software Testing]]></title><description><![CDATA[We need new tools to fix agentic systems.]]></description><link>https://blog.jetty.io/p/ai-breaks-the-two-assumptions-behind</link><guid isPermaLink="false">https://blog.jetty.io/p/ai-breaks-the-two-assumptions-behind</guid><dc:creator><![CDATA[Jonathan Lebensold]]></dc:creator><pubDate>Mon, 09 Feb 2026 21:30:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!B6E6!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F644cf333-50e2-43ea-b074-37304c7162cc_1280x1280.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.jetty.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.jetty.io/subscribe?"><span>Subscribe now</span></a></p><p>Automated software testing unlocked the ability for teams to incrementally deliver complex systems at scale. At the heart of this achievement are two assumptions that no longer hold with AI agents:</p><ol><li><p><strong>Keep code and data separate.</strong> Manage them independently.</p></li><li><p><strong>Fix everything and measure it once.</strong> Test at a point in time and trust the results.</p></li></ol><p>AI systems break both. We need a new playbook for reliably delivering the next generation of applications.</p><p>We steer the output of AI systems by co-mingling examples and instructions with engineered context: RAG pipelines, prompt templates, post-trained models. An AI agent performs completely differently depending on what information it sees and in what format. Measuring it divorced from production data gives you a proxy, not a result.</p><p>And unit tests were built for a deterministic world where you controlled the entire environment. Today, most AI systems call out to third-party models that can change behavior without notice. Failure is no longer binary. You need to evaluate a <em>distribution</em> of outcomes: confusion matrices, ROC curves, calibration checks. Not just pass/fail.</p><p><strong>The evaluation gap</strong></p><p>A friend&#8217;s startup lost 5% of their customers from a single bad AI deployment. They switched LLM providers &#8211; lured by a cheaper, faster model and a strong public leaderboard ranking &#8211; only to discover that tool-calling APIs behave differently between providers. What worked on one model broke on another. A colleague lost three months of work because a foundation model had been trained on an upstream dataset whose licensing expired. I watched data science teams speak with conviction about the elegance of their algorithms, only to see their work dismantled by a single edge case from a domain expert.</p><p>I founded <a href="https://www.jetty.io">Jetty</a> because I kept watching this gap between lab measurement and production reality destroy real value.</p><p><strong>The agile parallel</strong></p><p>I started my career when the agile movement, unit testing, and continuous delivery were upending &#8220;big bang&#8221; releases and waterfall development. Watching AI mature, I saw the same pattern repeating: AI was entering its own continuous delivery moment, and it had none of the infrastructure to support it.</p><p>SaaS took off when thousands of engineers could ship code to production every day. The key unlock was automated testing. Fix the code, fix the inputs, verify the outputs. It worked because the system was deterministic and the environment was under your control.</p><p>But what does automated testing look like when your data and your code are the same thing? What does it mean to test a system when the foundation model powering it is also changing underneath you?</p><p><strong>Everything is moving at once</strong></p><p>Classic software testing fixes code and inputs, then checks outputs. Data science teams do something similar: fix whole datasets and run them against fixed models. Both approaches assume you can hold things still long enough to measure them.</p><p>In production AI systems, nothing holds still. The data keeps changing. The models keep changing. The retrieval systems keep changing. The code is also incrementally changing. You can&#8217;t fix everything and measure it once.</p><p><strong>Bounding risk</strong></p><p>This might feel uncomfortable. If the system isn&#8217;t deterministic, how can you trust it?</p><p>Consider a speech-to-text scribe deployed for a healthcare provider. It works great on Parisian French but fails on Quebecois French. A patient says &#8220;la fin de semaine&#8221; instead of &#8220;le weekend,&#8221; or &#8220;un courriel&#8221; instead of &#8220;un mail,&#8221; and the system misinterprets critical details. A one-time evaluation would never catch this. An iterative evaluation system, one that continuously collects data from the populations actually being served, would catch these failures and fold them into benchmarks for future improvement.</p><p>All risk can be bounded. We trust risk-laden systems every day. We drive cars knowing catastrophic outcomes are real, but we&#8217;ve built layered infrastructure (speed limits, seatbelts, crumple zones, insurance) that makes the risk manageable.</p><p>AI systems need equivalent layers. Some live in production: observability, telemetry, evaluation datasets wired into CI/CD pipelines. Others run safely outside it: benchmarks, trace labeling, synthetic dataset generation. The key is closing the loop. Safely and privately feeding development teams the signals they need to understand how their system actually performs in the field.</p><p>The agile movement took a decade to go from manifesto to mainstream practice. AI systems are being deployed to millions of people right now. I don&#8217;t think we have ten years to close the AI evaluation gap.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.jetty.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Jonathan's Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item></channel></rss>