LLM caches: saving tokens without dropping quality
Actualizado: 2026-05-03
Placing a caching proxy between your application and the language-model provider is one of those optimizations that looks obvious the moment someone mentions it, yet many teams take months to implement because the correct design has more edges than it seems. The difference between a well-done cache and a poorly done one can mean 70 % token savings versus a degraded user experience nobody spots until too late.
Key takeaways
- Repeated or near-repeated requests are the basis of all cache savings.
- Four varieties cover the main cases: exact, normalized, semantic, and hierarchical.
- Semantic caching captures far more traffic but introduces false-equivalence risk.
- Large providers offer native prompt caching that removes part of the work.
- Without quality metrics run regularly, a cache can degrade the application invisibly.
The starting point
The thesis is simple: many requests your application sends to the model are repeated or near-repeated. If you detect the repetition and return the previous answer without calling the provider, you save the tokens for that request. In applications with frequent questions, documentation conversations, or recurring templates, repetition percentages can be very high.
Complication appears when defining what “repeated request” means. The trivial case is when input text is identical byte for byte. But most interesting requests have small variations: different spaces, substituted names, different dates. An unnormalized exact cache ignores those similarities and leaves savings on the table.
Exact cache
The simplest pattern: hash of the full request text as key, answer as value. Implemented in twenty lines on top of Redis or Memcached. Works well when requests come from automated systems with deterministic inputs or when users ask exactly the same thing repeatedly.
Before hashing, apply normalization: trim, collapse spaces, selective lowercasing. This improves hit rate without changing semantics and is almost always worth it.
Semantic cache
The next step is recognizing similar-but-not-identical requests. “What are support hours?” and “At what time does support operate?” are the same question in different words. A semantic cache captures this equivalence by embedding the request with a small model, searching a vector index for close prior requests, and reusing the answer if similarity crosses a threshold.
The benefits are enormous; the risks, equally so. False equivalence is the main trap: two semantically similar questions can demand different answers over some subtle detail. Threshold calibration is empirical and must be repeated if request distribution changes.
A useful mitigation is verifying the cached answer before returning it with a cheap model call: “is this answer correct for this request? Yes or no”. It adds cost but protects against false positives.
Hierarchical cache
In applications with long conversations the most interesting pattern is caching not the full request but structural parts: the system instruction, application context, retrieved documents. Large providers have introduced native prompt caching mechanisms that do this in an integrated way: mark some fragment as cacheable and if a later request reuses it, that part is charged at a reduced rate.
This hierarchical cache complements the earlier ones rather than replacing them. A well-designed application can combine exact local cache for trivial requests, semantic cache for equivalences, and provider prompt cache for the shared part of long contexts. Each layer captures a different type of repetition and savings add up.
Operational traps
- Privacy. If the application caches answers to requests containing personal data, that data gets stored unexpectedly. Clear policies required: what is cached, for how long, and with what deletion mechanisms.
- Invalidation. An answer cached three days ago may be stale if underlying data changed. Short TTL for time-sensitive answers and explicit invalidation when underlying data changes are needed.
- Silent degradation. A malfunctioning cache returns reasonable-looking answers. Without quality metrics run regularly on cached traffic samples, teams may believe everything is fine for months.
- Treating cache as a patch. A cache changes system properties: latency, consistency, behavior under failure. Design for these changes from the start; don’t install cache as a last resort when the bill rises.
Metrics that matter
Three metrics let you evaluate cache health:
- Hit rate. What percentage of requests were served from cache. Below 10 % the complexity isn’t worth it; above 40 % it’s clearly profitable.
- Effective token savings. Hit rate × average request size, converted to euros or dollars. The real number that matters.
- Perceived quality. Automated evaluation on real requests with and without cache, comparing results. If this metric isn’t watched, the cache can degrade the application invisibly.
Quality monitoring is especially critical when introducing semantic caching. The same metrics discipline applies when combining the cache with an inference router to maximize savings. For applications relying on AI agents or conversational RAG pipelines, the provider’s long-context prompt cache is especially valuable.
The production decision
The question I’d use as a gate before deploying is whether the team knows how to detect if the cache is returning bad answers. If no, the cache isn’t production-ready even if it technically works. If yes, with automated metrics and periodic reviews, cache is a massive lever.
Starting with normalized exact cache is prudent: immediate savings with minimal risk. Adding provider prompt cache when applicable captures another slice without extra risk. Only move to semantic cache once the necessary measurement discipline has been built. That progression separates caches that save money from caches that save money at the cost of something else.