LLMCostProduction

Cutting LLM Costs in Production: Caching, Model Routing, and Graceful Fallbacks

March 28, 202610 min readBy Yogendra Singh

LLM costs have a particular cruelty: they scale with usage in a way that is very good for the provider and very bad for your margin. A product that works at 100 users frequently has an unsustainable unit economics story at 10,000 users because nobody modeled token consumption carefully. The good news is that the reduction techniques are well-understood; the bad news is they require discipline to implement correctly and rigorously, because the temptation to cut corners produces visible quality regressions that erode user trust.

What follows is a concrete description of the cost-reduction stack I've applied in production — most recently in a multi-tenant AI reporting engine processing thousands of LLM calls per hour. None of these techniques are theoretical. Each one has a measurable impact and a specific failure mode to watch for.

Exact-match caching: the easy 30%

Before anything clever, implement exact-match caching. Hash the full prompt (system message + user message, model name, temperature, and any other parameters that affect output) and cache the response in Redis with an appropriate TTL. The hit rate varies by use case: for report summarization, where the same report is viewed by multiple users, we consistently see 25–35% cache hits. For open-ended chat, it's close to zero. Know your use case before dismissing this as trivially low-impact — in structured output scenarios it is often the single largest cost reduction available.

The failure mode to avoid: caching responses that include timestamps, user names, or any dynamic content that makes them incorrect for a different user or a later request. The fix is to extract and inject dynamic content after retrieval from cache, not before. This requires structuring your prompts so that dynamic content is in a predictable, extractable location — which turns out to be a good prompt engineering practice regardless.

Semantic caching: the harder 20%

Semantic caching answers the question: if two prompts are semantically equivalent, can we serve the same response? Embed the user query with a small, fast embedding model (text-embedding-3-small is excellent for this), store the embedding alongside the cached response, and at query time retrieve the nearest cached embedding above a cosine similarity threshold (we use 0.93 — below that the queries are too different to safely serve the same answer).

The latency cost of this approach is the embedding call plus a vector search. Both are fast — embedding takes 20–40ms and a Redis vector search over a few thousand entries takes under 5ms. The total overhead of ~50ms on a cache hit is almost always worth it given the cost saving. We saw an additional 18–22% cost reduction on top of exact-match caching in a reporting context where users often ask semantically similar questions with different phrasing.

Model routing: not every task needs a frontier model

The most expensive mistake in LLM production systems is routing every request to the most capable (and most expensive) model by default. Classification, extraction, templated summarization, and structured output generation are all tasks where a smaller model performs at or near frontier model quality at a fraction of the cost. We benchmark all task types in the system against the full model matrix quarterly and update the routing table based on measured quality scores.

Task classification (which type of report is being generated?): GPT-4o-mini. Cost per 1K calls: ~$0.001.
Structured data extraction from query results: GPT-4o-mini with JSON mode. Reliable, cheap, fast.
Narrative summary for Standard-tier tenants: GPT-4o-mini. Quality acceptable for operational reporting.
Narrative summary for Premium-tier tenants with complex financials: GPT-4o. Quality noticeably higher for nuanced multi-metric analysis.
Anomaly explanation requiring causal reasoning: GPT-4o. The capability gap here is real and measurable.

The routing logic lives in a single configuration table, not in application code. This means the product team can update model assignments without a deployment, and we can run A/B tests by splitting traffic at the routing layer. Routing model selection is logged with every request so cost attribution is precise — we know exactly how much each model costs per tenant per month.

Graceful fallback chains

A fallback chain means: if your primary model is unavailable (rate limited, API error, timeout), try a fallback model before returning an error to the user. This is table stakes for production reliability, but the implementation matters. A naive fallback that silently substitutes a lower-quality model for every request when the primary is slow defeats the purpose of routing. The fallback should only activate on transient errors (5xx, 429, timeout) — not on successful responses that are simply slower than expected.

We implement fallback chains as an ordered list per task type: primary provider, fallback provider, cached-response-if-exists, graceful degradation (return the raw data without an AI summary, with a visible UI indicator). That last stage — returning something useful without the AI layer — is the most important and most often skipped. An application that errors when the LLM is unavailable is more fragile than one that degrades gracefully. The user experience of 'your report is here but the AI summary is temporarily unavailable' is vastly better than a 500 error.

Token budget management and prompt optimization

Token budget management means never sending more tokens than necessary. This sounds obvious and is surprisingly hard to enforce without tooling. System prompts bloat over time as engineers add context. User messages that include full document text when a summary would suffice are common. We added a pre-call token estimator (using tiktoken for OpenAI models) that logs a warning when any call exceeds a configured threshold, and a hard ceiling that truncates the input and logs an alert when it would exceed the model's context limit.

Prompt optimization — reducing system prompt length without losing necessary context — is a recurring activity, not a one-time exercise. We do a quarterly prompt audit: measure the token count of every system prompt in production, identify the 20% that are longest, and work through them manually to find redundant instructions, repeated context, and verbose phrasing. A 30% reduction in system prompt length translates directly and linearly to a 30% reduction in input token cost for every call that uses that prompt.

LLM cost optimization is not about being cheap — it's about making the cost curve linear with value delivered, rather than with raw request volume. A system that spends a frontier model token budget on a classification task that a smaller model handles equally well is wasting money that could fund the capability improvements users actually want.

Measuring impact: the cost attribution stack

None of this is manageable without cost attribution. Every LLM call in our system emits a structured log event with: tenant_id, task_type, model_name, input_token_count, output_token_count, cache_hit (boolean), latency_ms, and cost_usd (computed from the model's per-token pricing). These events land in DataDog Logs, where a daily aggregation pipeline produces a cost-by-tenant, cost-by-task-type, and cache-hit-rate-by-task-type dashboard. Without this visibility, optimization is guesswork.

Reducing LLM infrastructure costs is one of the more tractable engineering problems in AI products right now — the techniques exist and they work, but they require methodical implementation. If you're facing a first LLM bill that doesn't match your projections, or you want to build cost controls into your LLM infrastructure from the start, that's a good fit for a short advisory or contract engagement.

Open to select projects