Jacar mascot — reading along A laptop whose eyes follow your cursor while you read.
Inteligencia Artificial Metodologías

FinOps applied to AI: where the cost really goes

FinOps applied to AI: where the cost really goes

Actualizado: 2026-05-03

Teams that have done FinOps well in traditional cloud find, when reaching AI workloads, that the tools they used stop helping. Kubecost, OpenCost, and AWS or GCP billing dashboards are very good at attributing cost to instances, storage and traffic. They aren’t prepared to tell you how much an agent call invoking three tools and generating two thousand output tokens costs. And that’s exactly where the bill concentrates once the AI system reaches production.

Key takeaways

  • The bulk of AI spend in 2025 is inference via third-party APIs, not own GPU compute.
  • Input tokens are what usually spiral: RAG systems with generous context multiply input cost by 5–6.
  • Chained tool calls can cost 10–15× more than a direct non-agent call.
  • Monthly experimental spend can exceed production if there is no organisational convention.
  • The first optimisation lever is always visibility: billing segmented by project, API key or tag.
  • Reducing retrieved chunks in RAG and routing to cheaper models are the two changes with the best cost-benefit ratio.

Where the cost really goes

Initial intuition often misleads. Many teams assume the bulk of spend will be GPU compute for their own training or fine-tuning. In reality, unless you operate a model lab, most 2025 AI spend is inference via third-party APIs, with two clear components: input and output tokens.

Within inference, input tokens are what usually spiral. They are paid per call even if the response is short, and grow invisibly when additional context is introduced. RAG systems retrieving eight to ten generously-sized chunks can multiply input cost by five or six over a context-free call. Agents with persistent memory and history accumulation grow even faster.

Output tokens are cheaper in frequency but more expensive per unit. If the application tends toward long responses with step-by-step reasoning, the output portion can dominate the bill. Extended-reasoning models like OpenAI o1 or the Claude thinking variant carry an additional multiplier via hidden reasoning tokens billed even though the user doesn’t see them.

Tool calls are the third, least-understood vector. Every time an agent invokes a tool, the tool’s response returns to the model as input for the next generation. In agents using many tools, calls chain and each adds further context. An agent calling three or four tools before answering can be spending 10–15× more than a direct non-agent call.

The fourth vector is reindexing. RAG systems reindex documents when they change, and each reindex incurs embedding cost. If a team doesn’t control reindexing frequency over large corpora, embedding can be a significant fraction of total. Reindex cost also factors into continuous RAG evaluation.

Failed experiments are the ghost spend

Every new prompt evaluation, every fine-tune tried, every A/B comparison between models consumes tokens. In teams iterating fast, the sum can be substantial. Unlike production spend, experiments rarely have clear business-unit attribution.

I’ve seen cases where monthly experimental spend exceeded production, simply because several engineers were iterating on their own API accounts without anyone aggregating. Providers offer team-segmented projects or keys, but disciplined use requires organisational convention, not just technical capability.

Controls that work

  • First layer: visibility. Billing segmented by project, API key or tag is a prerequisite. All major providers offer it. Without segmentation, there is no diagnosis.
  • Second layer: per-request instrumentation. Every model call should emit at least three metrics: model used, input tokens, output tokens. Aggregated by service and user, this builds the cost chart that locates hotspots. Tools like Helicone[1], LangSmith[2] or a homebrew OpenTelemetry sidecar work for this.
  • Third layer: soft limit per business unit. Each team or application should have an assigned monthly budget with warnings at 80 % before exhausting it. Hard limits are risky because they break service.
  • Fourth layer: model selection by context. Not every call needs the most expensive model. Price differences between GPT-4 and GPT-4o-mini, or between Claude Sonnet and Haiku, are an order of magnitude. Exploiting them requires routing logic more than fine optimisation.

Controls that are theater

Some sensible-sounding controls don’t pay off in practice:

  • Caching LLM responses looks like an obvious optimisation but fails in most applications. Cache hit ratio in production RAG typically sits below 5 % except in FAQ or repetitive support.
  • Capping max tokens per response just truncates responses without saving money if the provider bills actual generated tokens, not the limit.
  • Over-optimising system prompts yields diminishing returns: savings are marginal compared to cutting RAG context.

Some real numbers

A corporate support agent with knowledge-base access, serving about two thousand monthly users at three queries average, generated around 1,200 euros in monthly OpenAI spend before optimisation. After routing classification to a cheaper model, reducing retrieved chunks from ten to four, and caching embeddings, cost dropped to 380 euros with equivalent quality by human evaluation.

An email-draft generator was costing 2,800 euros monthly. Diagnosis showed the bulk was output tokens from long responses users later trimmed. Modifying the prompt to ask for a shorter initial draft cut cost to 1,100 euros, with most users reporting preference for the shorter version.

A RAG system with full daily reindexing over a two-million-document corpus was spending more on embedding than on inference. Moving to incremental reindexing with change detection cut embedding cost by 90 % with no search-quality impact.

My read

FinOps for AI demands new tools and new patterns. Instance and storage paradigms don’t translate well, and generic cost-observability tooling misses the unit that matters: the model call with its context and output.

What surprises me most when reviewing cases is how variable the cost-per-user ratio is between teams with comparable products. I’ve seen factor-five or factor-six differences attributable almost always to architectural decisions: model routing, context size, reindex frequency, and iteration count per call. None of those decisions is irreversible.

The practical recommendation is the same as with any cloud: instrument before optimising. Without per-request, per-user, per-model data, optimisations are hunches. With that data, two or three hotspots almost always emerge that concentrate half the spend and, with moderate changes, lower total cost without touching perceived quality.

Was this useful?
[Total: 10 · Average: 4]
  1. Helicone
  2. LangSmith

Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.