o1-preview: OpenAI’s Model That Thinks Before Answering
Actualizado: 2026-05-03
OpenAI introduced o1-preview and o1-mini on September 12, 2024 as a new model family with a philosophical difference from GPT-4o: they reason internally before emitting the final answer. It’s not just visible chain-of-thought; it’s a hidden process where the model elaborates, reconsiders, explores paths, and only then responds. Results in mathematics and code are dramatic. For other tasks, the latency and cost trade-off isn’t always worth it.
Key takeaways
- o1 introduces a hidden “thinking” phase where the model elaborates before responding — reasoning tokens are billed but not shown.
- In AIME it scores 83% vs GPT-4o’s 13%; in programming competitions (Codeforces) it sits at 89th percentile vs GPT-4o’s 11th.
- o1-preview costs $15/M input tokens and $60/M output; o1-mini is cheaper but more limited.
- In the preview version it doesn’t support function calling, streaming or multimodal vision.
- For chatbots, summaries, and quick questions, GPT-4o or Claude 3.5 Sonnet remain more efficient.
What Makes It Different
Traditional models (GPT-4o, Claude 3.5 Sonnet) generate response tokens sequentially from the first token. o1 introduces a prior user-invisible “thinking” phase where the model can reformulate the problem, explore approaches, discard strategies, before elaborating the answer. This internal reasoning consumes additional tokens — so-called reasoning tokens — which are billed but not shown to the user.
The approach is partially inspired by techniques like Tree of Thoughts and MCTS but integrated at training level, not prompt level.
Where It Excels
Benchmarks tell a clear story:
- On maths problems like AIME, o1-preview scores 83% vs GPT-4o’s 13%.
- On PhD-level physics problems, o1 reaches 78% vs 57% of human experts.
- In programming competitions like Codeforces, o1-preview sits at 89th percentile vs GPT-4o’s 11th.
For tasks requiring long chains of reasoning — olympiad maths, proofs, complex debugging, deep causal analysis — the qualitative leap is real and not just statistical.
Where It Doesn’t Add Much
For conversational tasks, creative writing, simple summaries, direct factual questions, o1 offers no significant advantage over GPT-4o, while adding latency and cost. A response GPT-4o generates in one second may take ten to twenty in o1 while it “thinks”.
Additionally, o1 has architectural limitations in the preview version:
- Doesn’t support function calling the same way as GPT-4o.
- Doesn’t stream.
- Isn’t multimodal.
The Cost Factor
o1-preview costs $15 per million input tokens and $60 per million output. o1-mini is cheaper — $3 input, $12 output — and is typically the sweet spot for most uses benefiting from reasoning. For comparison, GPT-4o costs $2.50 and $10 respectively.
Real cost is higher than nominal because hidden reasoning tokens consume output billing. An apparently short response may have internally consumed ten times the visible tokens.
When to Use and When Not
The pragmatic rule: for problems where the answer requires several chained reasoning steps and where correctness matters more than speed, o1 is worth it:
- Complex technical research with multiple interdependent variables.
- Legal analysis with multiple premises and exceptions.
- Maths or programming problems with rich structure.
- Strategic planning with chained-consequence decisions.
For chatbots, content generation, summaries, translations, quick questions or function calling, o1 introduces cost and latency without proportional benefit.
Honest Limitations
- Internal reasoning isn’t transparent — OpenAI explicitly hides reasoning tokens from users. This generates legitimate concerns about auditing and debugging.
- Published benchmarks are somewhat cherry-picked. o1 isn’t universally superior; on many everyday tasks it ties or loses to GPT-4o when normalised for cost.
- Open questions about sustainability: if each model generation requires ten times more inferential tokens, carbon footprint and economic cost scale exponentially.
Conclusion
o1 represents an inflection point in how we think about language models. It’s not a universal GPT-4o replacement but a specialised complement for deep reasoning. For problems where correctness matters more than speed, it’s worth every extra cent. For most everyday uses, traditional models remain more efficient. The direction it marks — scaling inference compute for reasoning — is probably the next dominant paradigm. Knowing when to apply it forms part of essential technical repertoire for any engineer integrating LLMs in production.