Jacar mascot — reading along A laptop whose eyes follow your cursor while you read.
Inteligencia Artificial

The initial Claude 4 family: first quality tests

The initial Claude 4 family: first quality tests

Actualizado: 2026-05-03

Anthropic released Claude Opus 4 and Claude Sonnet 4 on 22 May 2025. It’s the first big naming leap in the family since the 3.5 series appeared a year and a half ago. The number change isn’t just marketing: it represents a deep revision of how models reason about long tasks, especially programming, and a clear refocusing toward agentic flows that can chain many steps without human intervention. This post gathers one month of real use with both models in daily code work, technical documentation, and text review.

For context on the models preceding this family, the analysis of Claude 3.7 Sonnet covers the direction Claude 4 has continued. Model cost management in production is treated in FinOps for AI infrastructure.

Key takeaways

  • The clearest improvement is in programming: multi-hour refactors that used to require human intervention midway now progress without interruption.
  • Sonnet 4 covers 80% of cases at one fifth the price of Opus 4.
  • Agentic capability has improved notably: Opus 4 maintains goals through 30 steps with tool calls without drifting.
  • The 200 k token window is better utilized; details from the first 50 k tokens stay present without needing repetition.
  • Hallucinations persist on recent APIs; the knowledge cutoff is early 2025.

What changes versus 3.5 and 3.7

The 3.5 series, introduced in June 2024 with Sonnet and expanded in October, had set a high quality ceiling. In February 2025 came Claude 3.7 Sonnet with optional extended thinking. Claude 4 unifies those two directions in a family with two tiers:

  • Opus 4: for complex tasks where marginal quality matters.
  • Sonnet 4: daily-driver model with nearly equivalent quality and 5× lower price.

The clearest improvement is in programming. Anthropic claims Opus 4 leads SWE-bench Verified by a perceptible margin; in real use, multi-hour refactors that previously needed human intervention midway now proceed without interruption. The other visible improvement is long-context handling: both models keep the 200 k token window, but the utilization is better. Details of the first 50 k tokens stay present without repeating them, something that in 3.5 was needed every few turns.

Real programming use

For concrete coding tasks, the difference between tiers shows more than expected. Opus 4 sustains a coherent reasoning thread across multiple files. In a real Express.js to Fastify migration with 40 route files, Opus 4 identified cross-cutting dependencies that Sonnet 3.5 missed. Sonnet 4 also caught them but with occasional mistakes needing correction.

Where Sonnet 4 wins is in day-to-day interactive editing: short fixes, test runs, localized refactors. Lower latency and much lower cost make the experience better even if Opus might be slightly more accurate. The practical pattern that works is Sonnet 4 80% of the time and Opus 4 for complex problems where the cost of error is high.

Agentic capability has also improved clearly. In a SQL database administration agent, Opus 4 kept the goal through 30 steps with tool invocations without losing track, while Sonnet 3.5 drifted around step 15. This extended behavior is what makes these models fit better in long-running automated flows, a topic we develop in AI agents in the enterprise.

Long-text analysis and technical review

In long-text analysis tasks — reviewing technical specs, contracts, internal documentation — the difference between Claude 4 and 3.5 is smaller than in code. Both generations are competent and the perceptible jump is modest, around 10–15% accuracy improvement. A better ability to preserve the voice and tone of the original document when proposing revisions is noticeable: in the 3.5 series proposals sometimes sounded like Claude; in the 4 series the style is better respected.

Limitations that remain

Not everything improved:

  • Context window: stays at 200 k tokens. Sufficient for most cases but not for analyzing large repos whole without retrieval strategies.
  • Latency on Opus 4: a reply with extended thinking can take a minute or more. Opus 4 fits better as a batch or pipeline step, not as a fast-chat model.
  • Hallucinations: persist, especially on recent APIs or frequently changing libraries. The knowledge cutoff is early 2025; anything later must be supplied via context or tools.

Prices and when to use each

Reference prices at time of publication:

  • Opus 4: $15 per million input tokens, $75 per million output.
  • Sonnet 4: $3 and $15, respectively.

The 5× difference in Sonnet’s favor makes the natural pattern Sonnet 4 by default in IDE and interactive agents, reserving Opus 4 for tasks where error cost is high: architecture design, large refactors, critical reviews. For generating large text volumes, Haiku 3.5 remains the favorite for price and latency.

My read

Claude 4 is not a revolution but a well-executed consolidation. Anthropic has widened what already worked in 3.5 and 3.7 and pushed it up another rung, with real improvements in programming, agents, and long-context consistency. It’s not the kind of leap that justifies dropping all prior flows, but it is the kind that justifies revisiting each existing pipeline to see where 4 improves quality.

The Opus/Sonnet duality makes sense. Sonnet 4 becomes the default model, priced for continuous use with quality that closes in on Opus. For anyone designing products integrating language models, now is the moment to revisit the model choice: improvements in agent behavior and context consistency may unlock usage patterns that were clumsy with 3.5.

Was this useful?
[Total: 11 · Average: 4.3]

Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.