Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.

Inteligencia Artificial Metodologías

agentes ia claude costes evaluacion fiabilidad ingeniería de software produccion

Lessons from agents in production in 2025: summary for 2026

March 26, 2026 12 min read 138 reads

Table of contents

Key takeaways
Most frequent failure modes
Architectural patterns that work
The real cost that surprises
Tasks that don’t fit
An evaluation pattern that repeats
When it pays off

Actualizado: 2026-05-03

2025 was the first year AI agents stopped being pilot project and became productive system at hundreds of companies. Entering 2026, there’s enough data to extract lessons that in 2024 were still speculation. This article orders the learnings appearing recurrently in public postmortems, conversations with teams, and analyses shared by platforms like Anthropic, OpenAI, and LangChain.

Key takeaways

Three failure modes dominate: degenerative reasoning loop, data hallucination in RAG systems, and silent misalignment between what was asked and what was interpreted.
Cost per resolved task lands between $0.05 and $0.30 in most productive cases.
Teams with measurable success share a continuous evaluation pattern with datasets of 50 to 500 real cases.
Three characteristics make a task a bad candidate: high consequence of small error, undocumented context, and verification cost higher than the task cost.

Most frequent failure modes

Three failure types appear in most postmortems with clear distance over the rest.

Degenerative reasoning loop: the agent enters a cycle where each step consumes tokens but doesn’t advance toward the objective. Sometimes from tool error returning ambiguous result, sometimes from badly designed prompt not setting clear termination criterion. Economic cost can be high (documented cases of a single loop consuming over a thousand model calls in minutes).

The defense showing best result is double:

Hard step limit per task (commonly 15 to 30 depending on complexity).
Explicit termination criterion the agent must verify at each step.

When either barrier is missing, loops appear sooner or later.

Data hallucination in systems combining retrieval with generation: the agent searches information in internal documents, receives partial fragments, and fills gaps with plausible but invented data. In regulated sectors this has generated reportable incidents.

The working mitigation is demanding explicit citations where the agent references the exact source of each relevant data and clearly separating retrieved content from generated. Teams applying this pattern with discipline report ~90% reductions in hallucination incidents. For how to implement this evaluation, see production agent evaluations.

Silent misalignment between what the user asks and what the agent understands: in complex tasks, the agent completes what it interpreted but the user asked something different. Without intermediate verification, the error surfaces at the end and redo cost is high.

The solution is explicit confirmation points in long tasks: the agent presents its interpretation and plan before starting, the user approves or corrects, and only then executes. Adds friction but net balance is favorable in almost every context.

Architectural patterns that work

Of architectures tested during 2025, three show consistently better results:

Hierarchical decomposition with supervisor. A supervisor agent receives the task, decomposes it into subtasks, and delegates each to specialized agents with bounded context. The supervisor aggregates results. The advantage is context control: each subagent sees only what it needs.
Workflow with explicit state. Instead of free agent deciding its next step continuously, the system defines a state machine where each transition is validated. This pattern sacrifices flexibility but gains enormous traceability and debugging ease. Platforms like LangGraph and Temporal popularized the pattern.
Tools with preflight and confirmation. Before executing an action with external effect (write to database, send mail, call payment API) the agent presents the planned action and requests confirmation, either human or supervisor.

What has not worked well are fully autonomous multi-agent architectures where multiple agents debate until reaching consensus. They sound sophisticated but in practice introduce inconsistencies and hard-to-predict costs.

The real cost that surprises

Per-request cost to model dropped significantly during 2025, but aggregate cost of a productive agent rarely dropped proportionally. Agents consume much more context than initially forecast.

Cost per resolved task lands between $0.05 and $0.30 in most productive cases, figures multiplying naively estimated costs. An agent serving a thousand requests per day costs between $50 and $300 daily in model alone. To reduce these costs, complexity-based routing described in Claude Haiku 4.5 is the first lever.

The other cost surprise is observability. An agent generates very dense traces and storing and analyzing those traces consumes budget many projects didn’t foresee. Teams report total observability cost running 15 to 25 percent of model cost.

Tasks that don’t fit

The clearest learning of the year is that not every task is agent candidate. Three characteristics make a task a bad candidate:

High consequence of small error. Financial transfers, legal document modifications, medical decisions with immediate effect. Not because the agent can’t nail 99% of cases, but because the 1% when it fails is unacceptable.
Context that isn’t documented. An agent only knows what’s written; when the task depends on team tacit knowledge, the agent fails or invents. Before agent-izing a task, audit if the needed context exists in writing.
Verification cost higher than doing the task. If an expert takes 5 minutes to do something and 10 minutes to verify the agent did it right, automating doesn’t help.

An evaluation pattern that repeats

Teams with measurable 2025 success share a continuous evaluation pattern. They build a case set (typically 50 to 500 real cases) with expected answer, and run the agent against that set every time there’s a significant prompt, model, or tool change.

Initial investment is considerable but return appears soon. Without that set, any change is act of faith; with it, the team knows if a tweak improves, worsens, or doesn’t change performance. It’s the difference between engineering and craft.

When it pays off

For a team evaluating whether to put an agent in production, the practical filter is clear. It pays off when:

The task has recognizable structure.
Necessary context is documented.
Small-error cost is absorbable.
The team has capacity to operate continuous evaluation beyond deployment.

If any of the four fails, the agent will give trouble.

The cross-cutting learning from 2025 is that agents are less different from the rest of software than marketing suggests. They benefit from the same practices that make any productive system good: serious observability, continuous evaluation, architecture with explicit state, confirmation points at critical places, and humility about what isn’t yet solvable well. Teams treating agents as boring engineering instead of conversational miracle end up with systems that work.

Was this useful?

[Total: 6 · Average: 4.7]

Post Views: 138

Written by

Javier Cañete

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.

Lessons from agents in production in 2025: summary for 2026

Key takeaways

Most frequent failure modes

Architectural patterns that work

The real cost that surprises

Tasks that don’t fit

An evaluation pattern that repeats

When it pays off

Related posts

“EU AI Act 2026: a technical checklist for Spanish CTOs”

Agent observability with OpenTelemetry GenAI semconv in 2026

How to install and tune oMLX on M5 Max 128 GB

Multi-agent systems: LangGraph vs CrewAI vs Autogen in 2026