o3 in public: the reasoning leap is confirmed

Tablero de ajedrez con piezas dispuestas evocando cálculo estratégico complejo

OpenAI’s o3 series, announced in December with a wave of surprising benchmarks, has started rolling out to the public during January. First came o3-mini for ChatGPT Plus users and, shortly after, API access began opening to selected customers. Not the world’s widest deployment, but there’s now enough material from real users to evaluate with less uncertainty than during the first announcement weeks.

This post reflects my read after several weeks testing o3-mini on real cases, contrasting with comments from other technical users and third-party benchmarks. It’s not a marketing review, but an attempt to separate real leap from hype.

What has changed with o3

The o3 family continues the path opened by o1: models that spend more time and tokens “thinking” before answering, with an architecture of explicit internal reasoning. What’s new versus o1 is a quantitative leap in several directions: significantly better results in reasoning benchmarks (ARC-AGI is the one generating most headlines), better control of the thinking process, and more efficiency.

o3-mini, which is the one most people can use right now, is especially interesting. It’s the first of the series within reach of any developer with a reasonable budget, and though it doesn’t match the full o3, it’s already beyond GPT-4o for almost any problem requiring multi-step reasoning.

Where the leap is real

The first thing I noticed testing it is that logic and math problems that previously required guiding the model (careful prompting, forced chain of thought, intermediate verification) are now solved with direct answers. Where you used to have to say “think step by step and check your reasoning”, the model now does it internally and delivers something coherent.

This is especially evident in programming tasks with complex logic. Refactoring an algorithm, finding a subtle bug, designing a data structure with certain invariants: in all these cases o3-mini produces results that previously required several iterations with GPT-4o. The time gained per interaction is real.

It’s also notable in temporal and causal reasoning. Problems involving multiple steps with dependencies, like planning an infrastructure migration or reasoning about event order in a distributed system, now have more coherent responses. Not perfect, but the difference from earlier models shows.

In pure math, the result speaks for itself: o3-mini solves problems GPT-4o consistently failed. And when pushed to reason longer (via the reasoning_effort parameter), the margin grows further.

Where it still fails

Despite the leap, there are areas where o3 isn’t better than its predecessors, and can even be worse.

In creative text generation, o3-mini is slightly worse than GPT-4o. Prose feels more mechanical, less free. This is probably an effect of more reasoning-focused training, and for any case where the result is narrative text, fiction, or styled writing, GPT-4o remains preferable.

In tasks where the correct answer is ambiguous (recommendations, opinion, subjective analysis), o3-mini sometimes “overthinks” and delivers over-elaborated answers to questions asking for something more direct. It’s the equivalent of hiring an engineer to fix a chair: the result is technically correct but disproportionate.

Response time is much higher. A question GPT-4o would answer in 2 seconds can take 15 or 20 seconds in o3-mini. For interactive use cases (chat assistants), this is a real factor. You have to decide whether the quality leap makes up for the wait, and that depends on the specific workload.

And factual hallucinations, though less frequent, still happen. o3-mini invents names, dates, and references with the same shamelessness as its predecessors when it lacks the right information. Better reasoning doesn’t cure this.

The price impact

An important point: o3-mini is available at very competitive prices. On the API, cost per million input and output tokens is reasonable for the quality it offers, especially at the lower reasoning-effort tier. It isn’t as cheap as GPT-4o, but it’s in the same order of magnitude.

For full o3 (not the mini), announced prices are significantly higher, and initial tests suggest that for many tasks, the mini is more than enough. The choice between them will depend heavily on workload, but my intuition is that most real applications will use o3-mini, and only specific extreme-reasoning problems will justify full o3.

What changes for product builders

The most immediate change is in application architecture. If your application has a piece requiring complex reasoning (problem solving, structured analysis, planning), it now makes sense to evaluate o3-mini as the default for that specific piece, instead of GPT-4o. For the rest of the app (user text, short answers, personalization), you probably stay with faster, cheaper models.

This pattern of “different models for different tasks” is what will normalize. It’s no longer optimal to use a single model for everything; serious applications will route each query type to the best-fitting model. Frameworks like LangChain or LlamaIndex are integrating this selection logic.

For developers working so far only with GPT-4o or Claude, the mindset shift is recognizing that deep reasoning is no longer a prompt-engineering problem. Stop spending tokens guiding the model, and start trusting it reasons internally. It’s an adjustment that takes a few days but pays off.

My read

o3 confirms that the bet on models that “think more” works. The leap over o1 is material, not incremental. And the fact that o3-mini is available at reasonable prices means this isn’t a benchmark toy, but a tool usable in production.

The medium-term effect is that apps with complex-reasoning tasks will differentiate quickly: those using reasoning models will solve problems that those not using them will leave half-done. It’s the same pattern we saw when GPT-4 made GPT-3.5 obsolete for certain workloads, but applied now to a different segment of the problem.

If you work on a product with a reasoning piece, my concrete recommendation is to spend two or three days testing o3-mini on that specific piece. Most cases will benefit, and the few that don’t (for latency, cost, or task type) are also useful information. The evaluation cycle is short and the potential value is high.

Entradas relacionadas