Meter Before You Manage
The three layers between your LLM bill and actually fixing it
Your CFO pings you about a $47,000 “OpenAI API” line item. You know it’s high. You’ve known for months. But when they ask which features drive the spend, you can’t answer. Not because you’re hiding something. Because you don’t know.
Which endpoint? Which model? Which user flow? The monthly invoice is a single opaque line, and every conversation about it ends the same way: “We’ll optimize later.”
Later never comes. Optimization without instrumentation is guessing, and guessing doesn’t get sprint tickets.
The restaurant with no menu prices
Imagine running a restaurant where you can see total food cost each month but not which dishes drive it. You spent $40,000 on ingredients. Is the wagyu steak killing you, or is it the house salad that secretly costs twice what you charge? Cut portion sizes? Switch suppliers? Drop a menu item? Without per-dish cost data, every move is a guess.
The generic LLM cost advice floating around is the same kind of guessing. “Use smaller models.” For which calls? “Cache repeated queries.” Which queries repeat? “Shorten your prompts.” Which ones, and by how much?
Good strategies, all of them. None are actionable without metering.
What metering actually means
Metering isn’t observability. You might already have traces flowing into Langfuse or Arize. Good. But traces tell you what happened on individual requests. Metering tells you what your system costs in aggregate, broken down by dimensions you can act on.
Three layers, each building on the last.
Layer 1: The toll booth. Every LLM call goes through a proxy that records which model it hit, how many tokens it consumed, what it cost, and who requested it. LiteLLM is the most common choice here. One proxy, every provider, every call metered. No more reconciling your OpenAI bill against your Anthropic bill against your Azure bill in a spreadsheet. One ledger.
If your monthly cost conversation starts with someone logging into three different provider dashboards and adding numbers up manually, you’re at layer zero. Setting up a proxy is a day of work, maybe two. Per-team API keys give you basic allocation immediately.
Layer 2: Attribution. Raw metering gives you totals by model and endpoint. Attribution gives you totals by business dimension. “The FAQ chatbot costs $18,000 a month.” “The document extraction pipeline is 60% of our spend.” “Team X’s experimental feature costs more than the core product.”
This is where Langfuse traces become powerful. Not as individual request debuggers, but as the data source for cost attribution. Tag traces by feature, team, and customer tier. Aggregate. Now you have a menu with prices.
Layer 3: Optimization with evidence. Once you can see that the FAQ chatbot costs $18,000 a month, you can ask the right questions. Why so much? It handles 500,000 requests a month at 1,500 tokens average, mostly system prompt, on GPT-4o. Someone picked that model eight months ago and nobody revisited.
Now “use a smaller model” is actionable. Route FAQ classification to GPT-4o-mini and that $18,000 drops to under $3,000. Run the experiment. Measure quality. Show your CFO the before-and-after.
Without layers 1 and 2, layer 3 is a fantasy. Every optimization tip on the internet assumes you’ve done the metering work. Almost nobody has.
The scale of what’s hiding
Once you have metering, the waste is hard to miss.
Roughly a third of LLM queries are semantically similar to previous ones. A third of your spend, going to questions you’ve already answered. Semantic caching cuts 60-70% of those costs and drops latency from hundreds of milliseconds to double digits. But you can’t build a cache policy without knowing which queries repeat.
The price gap between model tiers is 20-30x, not 2x. Flagship models like GPT-4o run $2.50-10 per million tokens. Budget-tier alternatives like GPT-4o-mini or DeepSeek run a fraction of that. For classification, routing, and formatting tasks, the cheaper model often matches the expensive one. But “often” isn’t “always,” and you need per-task quality metrics to know which calls can safely move down.
These aren’t exotic findings. They’re the baseline state of most production AI systems. The waste is there whether you can see it or not. Metering just makes it visible.
Why “we’ll optimize later” never happens
I’ve talked to dozens of engineering leads who have “LLM cost optimization” on their roadmap. Not one has a sprint ticket for it.
The problem is activation energy. Without metering, optimization has no clear starting point. You’d need to audit every LLM call, figure out which models each endpoint uses, estimate traffic per route, benchmark alternatives, and build evaluation infrastructure to measure quality impact. That’s weeks of work with uncertain payoff spread across dozens of endpoints.
The mental math writes itself: “Two weeks analyzing costs for maybe 30% savings, or two weeks shipping a feature customers are asking for.” The feature wins every time.
Metering collapses this. When you can see that one chatbot feature costs $18,000/month on an overpowered model, the optimization becomes a two-hour task, not a two-week audit. The gap between “massive research project” and “swap a model string and validate” is just visibility.
The teams that optimize aren’t more disciplined. They instrumented earlier.
The uncomfortable parallel
Every company that’s been through cloud cost optimization knows this pattern. The early days of AWS were identical: teams spun up instances, ran workloads, and got a single bill at the end of the month. “Cloud spending” was one line item. It took years of tooling before teams could manage what they were spending. Cost allocation tags. Reserved instance planning. Right-sizing recommendations. Each layer made the next optimization possible.
LLM costs are at that same inflection point. The difference is speed. Cloud cost maturity took a decade. Enterprise LLM spending more than doubled in six months, from $3.5 billion to $8.4 billion. The scrutiny is coming faster than the tooling.
The teams that instrument now will be ready. The teams that don’t will keep promising finance they’ll optimize “next sprint,” knowing they can’t start what they can’t see.


