How to Reliably Generate Content
The review bottleneck nobody rebuilt after AI removed the production one
A marketing team I know used to produce about 50 assets per campaign. Social cards, email variations, ad units. A designer, a copywriter, and a brand manager reviewed every piece. Slow, but reliable.
Last quarter they generated over 4,000 assets.
Same team size. Same approval process — in theory. In practice, three people can’t review 4,000 assets. They spot-checked, trusted the prompts, and shipped.
This is everywhere now. Coca-Cola’s Fizzion platform produces personalized imagery across 100+ markets. Moet Hennessy scaled to 3 million content variations globally. The generation capacity is unbounded. The review capacity hasn’t changed at all.
What happens when nobody’s checking
We don’t have to imagine this. Coca-Cola’s AI-generated holiday campaign drew immediate backlash — “soulless,” “creepy,” characters that looked almost human but not quite. Google Gemini generated historically inaccurate images. Meta launched AI-generated fake profiles and retracted them within days.
These aren’t startups. These are companies with massive brand teams and explicit guidelines. They shipped anyway, because the speed of generation overwhelmed the humans who used to catch these problems.
And a 2025 study from the Nuremberg Institute found that just labeling content as AI-generated makes people view it as less trustworthy. Bad AI content doesn’t just look bad — it actively erodes brand trust.
LLM judges are the only path that scales
If humans can’t review 4,000 assets, the only option is another AI. Uncomfortable conclusion, but it’s arithmetic.
Coca-Cola got here first. Fizzion encodes 140 years of brand rules — the red hue, logo spacing, typography — into machine-readable metadata called StyleID. The AI that generates content is constrained by the AI that evaluates it. It’s now mandatory for all their agency partners.
Jasper’s Brand IQ does something similar — an LLM judge that flags violations and suggests on-brand replacements in real-time. Acrolinx scores content against your style guide and tone of voice. These aren’t research projects. They’re production systems.
The pattern is LLM-as-judge: define criteria, feed the output to an evaluator, get a structured score. Does this copy match our voice? Does this image follow our guidelines? Is the tone right for this market? Even if you heavily discount the vendor claims, an automated judge that catches 80% of violations beats the current reality of reviewing 2% and hoping.
Traceability is the other shoe
There’s a second problem beyond quality: provenance. When you produce 4,000 assets from a base concept, you need to trace any one of them back to the prompt that generated it, the model version, the brand rules, and the judge score it received.
The EU AI Act, taking effect August 2026, requires AI-generated content to be machine-readably marked and disclosed. Penalties: up to 15 million euros or 3% of global turnover. Google’s SynthID has already watermarked over 20 billion pieces of AI content. The C2PA standard embeds provenance metadata directly into files.
But watermarking solves attribution, not compliance. The full audit trail needs to connect generation to evaluation to deployment, with every decision logged. If you’ve built production AI systems, this should sound familiar — it’s the same observability infrastructure any LLM pipeline needs.
The hard question
Can LLM judges reliably catch the brand violations that actually matter?
Wrong logo, off-palette colours — easy. Any prompted evaluator catches those. The hard stuff is a headline that’s technically on-brand but tonally wrong for the British market. An image that follows every guideline but feels cheap. Copy that sounds robotic despite using approved terminology.
These are judgment calls from years of accumulated context — exactly the kind of tacit knowledge that’s hardest to encode, and exactly the kind of failure that does the most damage.
The teams doing this well use judges as a filter, not a replacement. Catch the mechanical violations automatically. Route flagged assets to humans with specific notes. The judge says: “This scored 6/10 on brand voice. Here’s why.” Automated tests catch regressions; code review catches design issues. You don’t skip either one, and you don’t run the automated tests by hand.
This is what we build at Jetty. You define your brand rules as judge criteria, point it at your content pipeline, and the LLM judge scores every asset — then hill-climbs to improve the ones that fall short. Minutes to set up, not months.
The content industry built incredible generation capacity and forgot to build the CI pipeline. The teams that close that gap first won’t just generate better content — they’ll be able to prove it’s on-brand at a volume where proof actually matters.
The rest will keep spot-checking 2% and waiting for the next PR crisis.


