o3 in public: the reasoning leap is confirmed
Actualizado: 2026-05-03
OpenAI’s o3 series, announced in December with a wave of surprising benchmarks, has started rolling out to the public during January. First came o3-mini for ChatGPT Plus users and, shortly after, API access began opening to selected customers. Not the world’s widest deployment, but there’s now enough material from real users to evaluate with less uncertainty than during the first announcement weeks.
This post reflects my read after several weeks testing o3-mini on real cases, contrasting with technical comments from other users and third-party benchmarks. It’s not a marketing review but an attempt to separate real leap from hype.
Key takeaways
- o3-mini produces significantly better results on logic, math, and complex code problems versus GPT-4o, without forced chain-of-thought.
- Response time is much higher (15-20 seconds vs. GPT-4o’s 2): a real factor for interactive use cases.
- For creative text generation, o3-mini is slightly worse than GPT-4o; prose feels more mechanical.
- Factual hallucinations still occur with similar frequency; better reasoning doesn’t cure this.
- The optimal pattern is using o3-mini only for complex-reasoning pieces and faster, cheaper models for the rest of the application.
What has changed with o3
The o3 family continues the path opened by o1: models that spend more time and tokens “thinking” before answering, with explicit internal reasoning architecture. What’s new versus o1 is a quantitative leap in several directions: significantly better results in reasoning benchmarks (ARC-AGI generated most headlines), better control of the thinking process, and more efficiency.
o3-mini, which is the one most people can use right now, is the first of the series within reach of any developer with a reasonable budget. Though it doesn’t match the full o3, it’s already beyond GPT-4o for almost any problem requiring multi-step reasoning.
Where the leap is real
Logic and math problems that previously required guiding the model — careful prompting, forced chain of thought, intermediate verification — are now solved with direct answers. Where you used to say “think step by step and check your reasoning,” the model now does it internally.
This is especially evident in programming tasks with complex logic. Refactoring an algorithm, finding a subtle bug, designing a data structure with certain invariants: in all these cases o3-mini produces results that previously required several iterations with GPT-4o. The time gained per interaction is real.
It’s also notable in temporal and causal reasoning. Problems involving multiple steps with dependencies — planning an infrastructure migration, reasoning about event order in a distributed system — now have more coherent responses.
In pure math, o3-mini solves problems GPT-4o consistently failed. And when pushed to reason longer (via reasoning_effort), the margin grows further.
Where it still fails
Despite the leap, there are areas where o3 isn’t better than its predecessors and can be worse:
- Creative text generation: o3-mini is slightly worse than GPT-4o. Prose feels more mechanical, less free. For narrative text, fiction, or styled writing, GPT-4o remains preferable.
- Ambiguous-answer tasks: o3-mini sometimes “overthinks” and delivers over-elaborated answers to questions asking for something direct. Like hiring an engineer to fix a chair.
- Response time: a question GPT-4o would answer in 2 seconds can take 15-20 seconds in o3-mini. A real factor for interactive use cases.
- Factual hallucinations: though less frequent, they still happen. o3-mini invents names, dates, and references with the same shamelessness as predecessors when it lacks correct information. Better reasoning doesn’t cure this.
The price impact
o3-mini is available at very competitive prices. On the API, cost per million input and output tokens is reasonable for the quality it offers, especially at the lower reasoning-effort tier. It isn’t as cheap as GPT-4o but is in the same order of magnitude.
For full o3 (not the mini), announced prices are significantly higher, and initial tests suggest that for many tasks, the mini is more than enough.
What changes for product builders
The most immediate change is in application architecture. The pattern of “different models for different tasks” will normalize:
- Complex-reasoning pieces (problem solving, structured analysis, planning): o3-mini as the default.
- Rest of the application (user text, short answers, personalization): faster, cheaper models.
For developers working so far only with GPT-4o or Claude, the mindset shift is recognizing that deep reasoning is no longer a prompt-engineering problem. Stop spending tokens guiding the model and start trusting it reasons internally.
My read
o3 confirms that the bet on models that “think more” works. The leap over o1 is material, not incremental. And the fact that o3-mini is available at reasonable prices means this isn’t a benchmark toy but a tool usable in production.
The medium-term effect is that apps with complex-reasoning tasks will differentiate quickly: those using reasoning models will solve problems that those not using them will leave half-done. If you work on a product with a reasoning piece, my concrete recommendation is spending two or three days testing o3-mini on that specific piece. Most cases will benefit, and the few that don’t are also useful information.