Jacar mascot — reading along A laptop whose eyes follow your cursor while you read.
Inteligencia Artificial Metodologías

FinOps on agent tokens: the invoice that surprises

FinOps on agent tokens: the invoice that surprises

Actualizado: 2026-05-03

The pattern is predictable. A team deploys its first production agent, the first month’s invoice arrives, and it’s double or triple the estimate. The reaction is usually panic followed by aggressive optimisation, sometimes at cost to quality. There’s a third way: disciplined FinOps applied to agents, with levers ordered by cost/risk ratio.

Key takeaways

  • Prompt caching cuts input token cost 50-90% with zero quality impact.
  • A model router saves 30-70% by assigning each task to the right model.
  • Context control prevents the quadratic per-turn cost growth.
  • Batching for non-interactive tasks offers 50% discounts from Anthropic and OpenAI.
  • Without per-tenant and per-task telemetry, none of the other levers can be calibrated.

Lever one: aggressive caching

The easiest win with zero quality impact is caching. Claude prompt caching[1], OpenAI prompt caching[2] and equivalents reduce input token cost by 50-90% when there’s structural repetition. Most applicable pattern: long system prompts with instructions, few-shots, or stable tool context between calls.

Typical implementation: everything that doesn’t change between turns is marked cacheable; what changes (user query, dynamic state) stays outside. Savings are immediate and require no changes to agent logic.

Lever two: model router by difficulty

Not every task needs the most expensive model. A router classifying the query and routing to the right model saves 30-70% depending on task distribution. Classification can be:

  • Simple: keyword rules.
  • Sophisticated: small model as classifier (Haiku 4.5 as router).

The common stack combines Haiku 4.5 or Gemini Flash for light tasks, Sonnet 4.6 for most traffic, and Opus 4.7 only for queries the router flags as complex. Key: measure router accuracy — if it misclassifies and sends the expensive model when unnecessary, savings evaporate.

Lever three: context control

Agents tend to accumulate context. Unchecked, a five-turn conversation reaches twelve thousand tokens; ten turns, twenty-five thousand. Per-turn cost grows quadratically because accumulation is billed on every call.

Techniques that work:

  • Periodic summarisation: every N turns, history compresses to a summary.
  • Sliding window: only the last K full turns.
  • Retrieval selection: at each turn’s start, recover only relevant fragments.

Combined, they cut per-conversation spend by 40-60% without user-perceived quality loss if well calibrated.

Lever four: batching where possible

Non-interactive tasks (nightly processing, report summarisation, bulk classification) accept batching. Anthropic Batch API[3] and OpenAI Batch API[4] offer 50% discounts for tolerable latency (hours instead of seconds). For flows where immediate response isn’t required, not batching is leaving money on the table.

Lever five: telemetry revealing real spend

Necessary condition for optimising is seeing where money goes. Minimum is metrics by:

  • Tenant and task.
  • Model and call type (cacheable or not).
  • Input and output tokens.

At that granularity, tenants consuming 10× the mean, poorly scoped tasks, and broken flows making looped calls surface easily. Tools including this telemetry: Helicone[5], Langfuse[6], Portkey[7], plus native provider dashboards.

What doesn’t work

Three frequent antipatterns:

  1. Switching provider for 10% price differences without changing anything else: engineering time exceeds savings.
  2. Downgrading model without evaluating: quality drops, customers complain, rollback with net loss.
  3. “Negotiating with provider” without real volume: volume discounts start where top 1% customers sit; below, no leverage.

Conclusion

Agent FinOps is a mature area with clear levers. Applied in order — caching, routing, context control, batching, telemetry — they cut cost by half or a third without perceived quality impact. What doesn’t work is reacting to the invoice with panic and cutting visible things; what works is investing a few days in instrumentation and architectural decisions that pay back in month one.

Was this useful?
[Total: 5 · Average: 4.2]
  1. Claude prompt caching
  2. OpenAI prompt caching
  3. Anthropic Batch API
  4. OpenAI Batch API
  5. Helicone
  6. Langfuse
  7. Portkey

Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.