MMatt Goren
← AI hub

How to Cut Your LLM Costs (Without Cutting Quality)

Prompt caching, batching, model routing, leaner context, output caps — the levers that drop your AI bill without touching output quality.

By Matt Goren · Updated June 25, 2026 · 8 min read

The fastest way to a scary AI bill is to build the thing first and think about cost never. It works in the demo, it works for the first hundred users, and then volume arrives and the invoice does something you have to explain to someone. The good news is that most LLM spend is genuinely wasteful — you're paying full price to reprocess the same context over and over, running everything through your most expensive model, and shipping more tokens in and out than the task needs. None of that is quality. You can cut it and your outputs stay identical. This is the order I work through the levers, cheapest wins first.

Start by measuring cost per task

You cannot cut a bill you cannot see. Before changing anything, get to a number per task — not a monthly lump sum, but "one product summary costs X, one support classification costs Y." Every API response reports token usage: input tokens, output tokens, and (once caching is on) how many input tokens were served from cache versus processed fresh. That's all you need. Multiply by your provider's published per-token rates and you have the cost of that call.

The piece people skip is tagging. Attach a label to each call for the feature it belongs to, then roll the costs up by tag. Without that you know you spent $4,000 last month; with it you know $3,200 of it was one chatty feature you could fix in an afternoon. Measure first so that when you pull a lever you can prove it moved the number, not just assume it did. This is the same discipline I lean on for quality — see knowing your AI works — applied to dollars.

Prompt caching: the biggest free win

For most apps this is the largest single lever, so it goes first. If you send a big stable block of context on every request — a long system prompt, a reference document, a set of few-shot examples, a tool list — the provider normally reprocesses every token of it each time. Prompt caching lets it reuse the work it already did on that prefix. Cached reads cost roughly a tenth of full-price input tokens. Same output, fraction of the input bill.

The one rule that makes or breaks it: caching is a prefix match. The provider caches from the start of the prompt up to a marked point, and any byte change before that point invalidates everything after it. So put your stable content first and your volatile content last. The classic mistake is interpolating something like Current time: 14:32:07 or a per-request ID into the top of the system prompt — that single changing string means nothing ever caches, and you won't get an error, just a bill that never drops.

[ tools ]            ← stable, goes first
[ system prompt ]    ← stable
  --- cache marker ---
[ the user's actual question ]  ← changes every time, goes last

Verify it's working by reading the cached-token count off the response. If it stays at zero across requests that should share a prefix, something volatile is sneaking into the front — hunt it down. I go deep on this in context engineering, because where you place things in the prompt is the same skill whether you're optimizing for quality or for cost.

Route work to the right-sized model

The second big lever is to stop running everything through your most expensive model. Providers ship a ladder — in the Claude family that's Opus 4.8 at the top for the hardest reasoning, Sonnet 4.6 as the balanced workhorse, Haiku 4.5 as the fast cheap tier, with Fable 5 for the most demanding long-horizon work. Every serious provider has the same shape. Most of your traffic does not need the top rung.

The pattern I reach for is fan out cheap, escalate to frontier. Run the high-volume, well-scoped work — classification, extraction, tagging, routing, first drafts — on the small fast model. Have it flag the cases it's unsure about, and send only those up to the frontier model. Most inputs get handled cheaply; only the genuinely hard minority pays the premium. You land near frontier-quality on the hard cases while keeping the bulk of your bill at small-model rates. I unpack the full decision in big model vs small model — the short version is to anchor on the cost of being wrong. Cheap-and-recoverable goes to the small model; expensive-and-irreversible goes to the frontier one.

This is the one lever that can touch quality, so validate it. Run a real eval on your own data before trusting the swap in production. The win is real, but "it felt fine in two test cases" is not how you find out.

Batch the work nobody's waiting on

A surprising share of LLM work isn't latency-sensitive. Nightly document processing, bulk classification of a backlog, generating embeddings for an entire catalog, backfilling summaries — nobody is staring at a spinner for any of it. Batch processing runs those requests asynchronously at a steep discount in exchange for giving up real-time speed. Same model, same quality, materially lower price.

The mental model is two buckets. "Someone is waiting" stays on the normal synchronous path. "This can finish whenever" goes to batch. Splitting your workload that way is often a one-line routing change that quietly removes a big chunk of spend. The catch is that results come back out of order, so key each result to the request it answers rather than assuming position.

Trim the context you're shipping

You pay for every input token, so dead weight in the prompt is dead money — and worse, bloated context also degrades quality, so trimming it wins twice. The usual offenders:

  • Whole documents when a slice would do. If only one section is relevant, retrieve and send that section, not the entire file. Pulling the right slice is the core of good retrieval — see RAG vs fine-tuning vs long context.
  • Conversation history that never gets pruned. In a long chat or agent loop, raw history grows without bound. Summarize or clear stale turns instead of resending everything forever.
  • Ten few-shot examples when three carry the pattern. Examples are powerful and also expensive on every single call. Keep the ones that change behavior; cut the rest.
  • Verbose system prompts. Tighten them once and every future call is cheaper. This is editing, not deleting — you're removing words that don't earn their tokens.

The instinct to "give the model everything just in case" is the costly one. More context is not more better; it's more expensive and often less focused.

Cap output deliberately

Output tokens usually cost several times more than input tokens, so what comes back matters even more than what goes in. Two cheap moves. First, set a sensible max_tokens ceiling per task so a runaway generation can't balloon — but set it with headroom, because capping it so low that answers get chopped mid-sentence forces a retry and costs you more. Second, ask for less in the prompt. "Answer in one sentence" or "return just the JSON" genuinely produces fewer output tokens, not just a tidier vibe. For structured tasks, constraining the model to a strict output format both shrinks the response and removes the rambling preamble you were paying for.

The order that actually works

Pull the free levers before the tradeoff ones. Caching and batching change nothing about output — do them first and reflexively. Trimming context cuts cost and improves quality at the same time, so it's pure upside. Model routing is the powerful one with a real edge, so save it for after the free wins and validate it with an eval. And keep the cost-per-task number in front of you the whole time, because the entire game is making a decision, watching the number move, and keeping what worked. Done in that order, you usually find the bill was mostly waste — and cutting waste was never a quality decision in the first place.

FAQ

What's the single biggest lever for cutting LLM costs?

Prompt caching, for most apps. If you send the same large block of context — a system prompt, a document, a set of examples — on request after request, caching lets the provider reuse the work it already did on that prefix instead of reprocessing it every time. Cached reads cost a small fraction of full-price input tokens. When the same context repeats across many calls, this routinely takes the biggest bite out of the bill, and it changes nothing about the output. After that, the next biggest lever is model routing: stop paying frontier prices for work a cheaper tier handles fine.

Does cutting costs always mean cutting quality?

No, and that's the whole point. Caching, batching, and trimming dead context lower the bill with zero quality impact — you're paying less for the exact same output. Model routing only touches quality if you route a hard task to a model that can't handle it, which a confidence check and an escalation path prevent. The cuts that hurt quality are the crude ones: truncating context blindly or capping output so low that answers get chopped mid-sentence. Pull the free levers first and you rarely have to make a real tradeoff.

When should I use batch processing instead of normal API calls?

Whenever the work isn't latency-sensitive — nobody is sitting there waiting for the response. Batch processing runs your requests asynchronously at a large discount in exchange for giving up real-time speed. Overnight document processing, bulk classification, generating embeddings for a whole catalog, backfilling summaries — all perfect. Anything a user is actively waiting on stays on the normal synchronous path. Split your workload into "someone's waiting" versus "this can finish whenever" and route the second bucket to batch.

How do I measure cost per task?

Read the token usage off every API response — input tokens, output tokens, and cached-read tokens are all reported — and multiply by your provider's per-token rates. Tag each call with what task it belongs to so you can roll the numbers up per feature, not just one big monthly total. The goal is a number like "this summary costs $0.004" so you can see which features are expensive and whether a change actually helped. You can't optimize a bill you can't break down.

Will a smaller model save money without wrecking output?

On well-scoped tasks, yes — classification, extraction, tagging, routing, and first-draft generation run great on a fast cheap tier for a fraction of frontier cost. The trick is matching the model to how hard the task actually is. Keep the frontier model for genuinely hard, high-stakes calls and fan the high-volume easy work out to a small model. Validate the swap with a real eval on your own data before you trust it in production.

#cost#building#llm
Want to apply this right now?

Use the free, no-API prompt generators to put it into practice.

Open Prompt Studio →
Keep reading