EngineeringDecember 9, 202510 min read

The hidden cost of always-on LLM calls

Token spend is the obvious number on the invoice. The expensive ones - latency, retries, context bloat, eval drift, oncall load - never show up there.

When a team starts watching their LLM bill, they tend to watch the wrong number. Tokens are easy to graph, so tokens become the metric. Meanwhile, five other costs quietly stack up - and four of them never show up on a vendor invoice. By the time someone notices, the project has become a budget conversation instead of an engineering conversation, and the answer is always “use a smaller model,” which usually means “ship a worse product.”

We have been on the wrong side of every one of these costs at some point. This is the field guide we wish someone had handed us before we built our first production AI system. If you are running an LLM-backed product, or planning to, here is what actually shows up in the budget - and what to do about each of them.

Tokens are a lagging indicator

Token cost is real, but it is downstream of every decision you make about latency, retries, context, and prompt design. If you focus the optimization conversation on tokens, you will end up squeezing the wrong levers. Smaller models. Shorter prompts. Cutting examples that were doing the work of three pages of instructions. Each of those moves saves a few cents per call and quietly degrades quality, which then shows up as retries, which costs more tokens than you saved.

The right move is to leave token cost alone for the first round and chase the upstream costs. They are bigger and cheaper to fix.

Latency is a feature cost

A two-second model call is a two-second product moment. If it sits in front of the user, the product feels slow. If it sits inside a workflow, you cannot parallelize the next step. If it sits inside an agent loop, you pay the latency once per iteration and the user feels it as a multiplier.

Latency is rarely on the dashboard during a pilot because the pilot runs five times a day. In production, p95 latency is the difference between a feature people use and a feature people abandon. We have replaced more “smart” calls with cached lookups, deterministic parsers, and small embedded classifiers than we have added new ones. Every replacement was strictly better on latency, often better on quality, and almost always cheaper. The model call is not free. Treat it like a network call to a slow third party, because that is what it is.

Practical fixes:

Cache anything that is determined by an input you have seen before. The cheapest model call is the one you do not make.
Run multiple independent model calls in parallel where the control flow allows. Sequential calls inside a single request are usually unnecessary.
Stream output to the user where the surface supports it. The perceived latency drops by half even though the total time is the same.
For high-volume classification or extraction tasks, ask whether a small fine-tuned or even non-LLM model would meet the bar. Often it will, at one-tenth the latency.

Retries hide in the average

A 1% failure rate sounds fine. At 50,000 calls a day, that is 500 retries - each one paying full token cost, often with a longer prompt because the retry includes the failure, and each one taking longer than the original call because the retry usually escalates to a slower model or a richer context.

Retries also hide in the average. Your dashboard says the average call is 1.4 seconds. The 99th percentile is 11 seconds because every retry storm pulls the tail to the right. The user who hits a retry storm is the user who never comes back. Average latency is a comforting number that rarely tells you what your product feels like to use.

Three things to instrument before you ship:

Retry count per request, not just per call. A request that retried six times once is much more interesting than the same total retry count spread evenly.
Cost of retries as a separate budget line. If 18% of your spend is retries, that is your top optimization target, not the prompt length.
Failure mode taxonomy. Group retries by what triggered them: timeout, malformed output, validation failure, tool error. The fix for each one is different, and aggregating them hides the answer.

Context engineering costs more than the tokens it adds

Context grows. A prompt that started at 800 tokens is, six months later, 4,400 tokens because every team that touched it added their own instructions, examples, and edge case handling. Each addition felt cheap. The cumulative cost is not.

The token cost of context bloat is the small problem. The real costs:

Prompt-level interference. Long prompts interact with each other in ways that are not obvious. An instruction added to fix one edge case will silently change behavior on three other edge cases. You will not notice unless you have an eval suite watching.
Iteration cost. A 4,400-token prompt is much more expensive to iterate on than an 800-token one. Every prompt change costs more in API spend, takes longer to run on the eval set, and is harder to reason about.
Cache invalidation. Most LLM providers offer prompt caching, which makes the static prefix of a prompt nearly free. As soon as your prompt is being edited weekly, you are constantly invalidating the cache and paying full price.

Treat the prompt like code. Refactor it. Delete instructions that are not earning their keep. Move stable content to the cached prefix and put dynamic content at the end where the cache is not in play. A prompt audit twice a year will pay for itself in a week.

Evals drift the day you stop running them

The most expensive bug we have seen in an LLM system was not a hallucination. It was a prompt that quietly degraded after a model upgrade. Nobody noticed for six weeks because nobody was running the eval suite. The cost showed up in churn, not the OpenAI bill. By the time the regression was identified, we had to rebuild trust with two enterprise customers.

Eval drift is the cost that scares us the most because it does not show up on any operational dashboard. The system is up. Latency is fine. Token spend is normal. Outputs are being produced. They are just slightly worse than they were, in ways the model can produce convincingly enough that nothing flags it.

The fix is not complicated. It is just not glamorous:

Maintain a fixed eval set of 100 to 500 representative examples with known good outputs or rubric-based grading.
Run the eval on every prompt change, every model upgrade, and on a weekly schedule even when nothing changed.
Post the pass rate somewhere a human will see it. A channel, a dashboard, a weekly email. Not a folder nobody opens.
Treat a drop in pass rate as a P1, not a P3. Eval regressions compound silently.

For more on this, see our note on why most AI pilots stall before production - eval discipline is one of the items on the production-readiness checklist that separates the systems that ship from the ones that languish.

The oncall cost nobody priced in

An LLM-backed system in production has the same operational load as any other production service, plus a few categories of incident the typical SRE has never seen before:

Provider outages. Your provider goes down. Your retry policy now matters more than your prompt.
Provider deprecations. A model gets retired. Your team now has two weeks to revalidate the eval suite on a new model and explain to stakeholders why the output is slightly different.
Quality incidents. A specific class of input is producing bad output. Diagnosing this is harder than diagnosing a 500 error because the system did not fail - it just answered wrong.
Prompt injection or content incidents. A user finds a way to make the system say something it should not. This is now your problem.

None of this is in the pilot scope. All of it shows up in production. The team that owns the system needs runbooks, rotation, and authority to make judgment calls in the moment. Pricing this in early - as headcount, not as a platform line item - is one of the things that separates teams that successfully run LLM systems from teams that end up paying a vendor more to do it for them.

What to actually watch

The dashboard we set up for any production LLM system has five charts. None of them are token spend.

Cost per successful action. Total spend divided by completed user-visible outcomes. This is the only unit-economics number that matters. Tokens are a downstream factor.
p95 latency on the user-facing path. Not average. p95 is what users feel.
Eval pass rate over time. Run on a schedule, posted publicly. The leading indicator of quality regressions.
Retry rate, broken down by failure mode. Aggregated retries are a comfort blanket. The taxonomy is the actionable view.
Spend share of cached vs. uncached calls. Tells you whether your prompt cache is doing its job and warns you when prompt churn is eating the savings.

Token spend is on the invoice. You will see it whether you watch for it or not. The point of the dashboard is to surface the costs that the invoice will not.

The cheapest LLM call is the one you do not make

We end up giving this advice in some form on every project, so we will write it here. Before adding a model call, ask whether a cache, a small classifier, a deterministic parser, or a hand-written rule would do the job. Often something will. Models are good at judgment under ambiguity. They are wasteful for everything else.

Architecture decisions like this are why we keep insisting on workflows over agents for most operational problems. A workflow lets you put a cheap step where a cheap step works and reserve the model for the call that genuinely needs it. An agent puts a model in the middle of every decision, including the ones a hash table would have answered.

Token cost is the part of the bill you can see. The rest of the bill is the part that decides whether the project survives the year.

Contact us.