A Pelican Learns to Ride
What happens when every iteration of a benchmark is on disk
People ask, occasionally, why our logo is a pelican. A company called Jetty, a bird on the dock — there must be a story. The honest answer is anticlimactic, and I’ll get to it.
The reason the question comes up at all is Simon Willison. For the last year or so, he’s been asking every new model to generate an SVG of a pelican riding a bicycle and posting the results. It’s become the unofficial visual reasoning test for frontier models. So when people see the Jetty mascot, the assumption is that we picked the pelican as a wink at the benchmark.
We didn’t. Our designer worked through a stack of jetty-adjacent motifs — boats, ropes, lighthouses, a few different seabirds — and the pelican was the one the team kept coming back to. That decision predates the benchmark mattering to us. The overlap is coincidence. A good one, but coincidence.
It is, however, the kind of coincidence you don’t ignore.
So we ran the benchmark
Simon’s test is a good one. It exercises layout, anatomy, two distinct subjects, motion, and the model’s ability to keep coherent state across hundreds of XML elements. It’s visual enough that you can tell at a glance whether the model got it.
We ran the same task eighteen times — eleven times with one model while iterating the prompt, seven times with one prompt while iterating the model — and captured every trajectory. The result is at pelicans.jetty.bot. Seven agent/model combinations. Seventy-one runbooks across seven lineages. Every iteration replayable.
The point isn’t the pelicans. The point is what you can do once each attempt is a structured artifact instead of a screenshot in a tweet.
Three views
The Climb ranks all seven agent/model combos on the same v1 runbook. Scatter chart, best-of-ten table. You can see the shape of how each agent’s score moved across its ten rounds — not as a final number, but as a curve.
Head-to-Head is a filmstrip viewer. Arrow keys step through rounds; up/down switches agents. You’re watching the same task unfold ten times in parallel for each lineage, and the convergence is its own kind of evidence. Some lineages crawl. Some snap into shape on round three and then drift.
Runbook Diffs is the view I keep coming back to. The runbook carries a baseline SVG embedded as a seed — so iterating the runbook means editing that seed plus the targeted prompt asks. Pick any two versions across the seven lineages and diff them line by line, side-by-side or unified. It looks exactly like reviewing a pull request. Because that’s what it is.
The judge gap
A gemini-cli run scored 40 out of 40 on the rubric. Pelican, bicycle, composition, polish — four perfect tens.
Then we re-judged the same SVG with Claude Sonnet. It came back at 37.
Same image. Same four-axis rubric. Different judge. Gemini Flash, used as the judge, awards full marks freely. Sonnet, scoring the same SVG against the same axes, caps out around 37 and rarely gives a perfect score. The temptation is to call one of them right. Neither of them is. They’re calibrated differently. Flash’s 40 isn’t the same number as Sonnet’s 40, and pretending otherwise is how leaderboards lie.
This is the practitioner problem with LLM-as-judge that nobody puts on the box. At some point you’ll want to compare scores across two evaluations — same rubric, different runs — and reach a conclusion. If the judge changed underneath you, even silently, the numbers are no longer on the same axis. You either lock the judge for the lifetime of the comparison, or you design the rubric so swapping doesn’t matter. Most rubrics don’t survive that test.
We didn’t fix it. We just made the gap visible.
What this kind of artifact is for
There’s a broader thing here that has nothing to do with pelicans.
Most benchmark posts give you a headline number and a screenshot. Sometimes a GitHub repo with the prompts. What you can’t do is replay the attempt, see what the model produced on round three before converging, or diff version four of the runbook against version seven and ask which edit moved the score. The interesting questions live in the trajectories, and the trajectories don’t exist.
When every iteration is captured, the benchmark becomes inspectable in the way code is inspectable. The runbook is the source. The trajectory is the build artifact. The diff between two runbook versions is a PR you can review, comment on, and revert.
The pelican project is small enough to fit on one page and weird enough to share. The structure underneath is the part worth stealing.





