Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.

Arquitectura Inteligencia Artificial

ahorro cache coste tokens infraestructura ia llm proxy rendimiento

LLM caches: saving tokens without dropping quality

November 29, 2025 10 min read 66 reads

Table of contents

Key takeaways
The starting point
Exact cache
Semantic cache
Hierarchical cache
Operational traps
Metrics that matter
The production decision

Actualizado: 2026-05-03

Placing a caching proxy between your application and the language-model provider is one of those optimizations that looks obvious the moment someone mentions it, yet many teams take months to implement because the correct design has more edges than it seems. The difference between a well-done cache and a poorly done one can mean 70 % token savings versus a degraded user experience nobody spots until too late.

Key takeaways

Repeated or near-repeated requests are the basis of all cache savings.
Four varieties cover the main cases: exact, normalized, semantic, and hierarchical.
Semantic caching captures far more traffic but introduces false-equivalence risk.
Large providers offer native prompt caching that removes part of the work.
Without quality metrics run regularly, a cache can degrade the application invisibly.

The starting point

The thesis is simple: many requests your application sends to the model are repeated or near-repeated. If you detect the repetition and return the previous answer without calling the provider, you save the tokens for that request. In applications with frequent questions, documentation conversations, or recurring templates, repetition percentages can be very high.

Complication appears when defining what “repeated request” means. The trivial case is when input text is identical byte for byte. But most interesting requests have small variations: different spaces, substituted names, different dates. An unnormalized exact cache ignores those similarities and leaves savings on the table.

Exact cache

The simplest pattern: hash of the full request text as key, answer as value. Implemented in twenty lines on top of Redis or Memcached. Works well when requests come from automated systems with deterministic inputs or when users ask exactly the same thing repeatedly.

Before hashing, apply normalization: trim, collapse spaces, selective lowercasing. This improves hit rate without changing semantics and is almost always worth it.

Semantic cache

The next step is recognizing similar-but-not-identical requests. “What are support hours?” and “At what time does support operate?” are the same question in different words. A semantic cache captures this equivalence by embedding the request with a small model, searching a vector index for close prior requests, and reusing the answer if similarity crosses a threshold.

The benefits are enormous; the risks, equally so. False equivalence is the main trap: two semantically similar questions can demand different answers over some subtle detail. Threshold calibration is empirical and must be repeated if request distribution changes.

A useful mitigation is verifying the cached answer before returning it with a cheap model call: “is this answer correct for this request? Yes or no”. It adds cost but protects against false positives.

Hierarchical cache

In applications with long conversations the most interesting pattern is caching not the full request but structural parts: the system instruction, application context, retrieved documents. Large providers have introduced native prompt caching mechanisms that do this in an integrated way: mark some fragment as cacheable and if a later request reuses it, that part is charged at a reduced rate.

This hierarchical cache complements the earlier ones rather than replacing them. A well-designed application can combine exact local cache for trivial requests, semantic cache for equivalences, and provider prompt cache for the shared part of long contexts. Each layer captures a different type of repetition and savings add up.

Operational traps

Privacy. If the application caches answers to requests containing personal data, that data gets stored unexpectedly. Clear policies required: what is cached, for how long, and with what deletion mechanisms.
Invalidation. An answer cached three days ago may be stale if underlying data changed. Short TTL for time-sensitive answers and explicit invalidation when underlying data changes are needed.
Silent degradation. A malfunctioning cache returns reasonable-looking answers. Without quality metrics run regularly on cached traffic samples, teams may believe everything is fine for months.
Treating cache as a patch. A cache changes system properties: latency, consistency, behavior under failure. Design for these changes from the start; don’t install cache as a last resort when the bill rises.

Metrics that matter

Three metrics let you evaluate cache health:

Hit rate. What percentage of requests were served from cache. Below 10 % the complexity isn’t worth it; above 40 % it’s clearly profitable.
Effective token savings. Hit rate × average request size, converted to euros or dollars. The real number that matters.
Perceived quality. Automated evaluation on real requests with and without cache, comparing results. If this metric isn’t watched, the cache can degrade the application invisibly.

Quality monitoring is especially critical when introducing semantic caching. The same metrics discipline applies when combining the cache with an inference router to maximize savings. For applications relying on AI agents or conversational RAG pipelines, the provider’s long-context prompt cache is especially valuable.

The production decision

The question I’d use as a gate before deploying is whether the team knows how to detect if the cache is returning bad answers. If no, the cache isn’t production-ready even if it technically works. If yes, with automated metrics and periodic reviews, cache is a massive lever.

Starting with normalized exact cache is prudent: immediate savings with minimal risk. Adding provider prompt cache when applicable captures another slice without extra risk. Only move to semantic cache once the necessary measurement discipline has been built. That progression separates caches that save money from caches that save money at the cost of something else.

Was this useful?

[Total: 0 · Average: 0]

Post Views: 66

Written by

Javier Cañete

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.

LLM caches: saving tokens without dropping quality

Key takeaways

The starting point

Exact cache

Semantic cache

Hierarchical cache

Operational traps

Metrics that matter

The production decision

Related posts

Hybrid RAG in 2026: the patterns that keep winning

MCP as multi-vendor standard: patterns already mature

Skills and subagents: the agent reuse pattern

Kubernetes 1.35 GA: an operations-side balance sheet