Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.

Inteligencia Artificial

benchmarks chain of thought o1 openai reasoning

o1-preview: OpenAI’s Model That Thinks Before Answering

December 19, 2024 10 min read 116 reads

Table of contents

Key takeaways
What Makes It Different
Where It Excels
Where It Doesn’t Add Much
The Cost Factor
When to Use and When Not
Honest Limitations
Conclusion

Actualizado: 2026-05-03

OpenAI introduced o1-preview and o1-mini on September 12, 2024 as a new model family with a philosophical difference from GPT-4o: they reason internally before emitting the final answer. It’s not just visible chain-of-thought; it’s a hidden process where the model elaborates, reconsiders, explores paths, and only then responds. Results in mathematics and code are dramatic. For other tasks, the latency and cost trade-off isn’t always worth it.

Key takeaways

o1 introduces a hidden “thinking” phase where the model elaborates before responding — reasoning tokens are billed but not shown.
In AIME it scores 83% vs GPT-4o’s 13%; in programming competitions (Codeforces) it sits at 89th percentile vs GPT-4o’s 11th.
o1-preview costs $15/M input tokens and $60/M output; o1-mini is cheaper but more limited.
In the preview version it doesn’t support function calling, streaming or multimodal vision.
For chatbots, summaries, and quick questions, GPT-4o or Claude 3.5 Sonnet remain more efficient.

What Makes It Different

Traditional models (GPT-4o, Claude 3.5 Sonnet) generate response tokens sequentially from the first token. o1 introduces a prior user-invisible “thinking” phase where the model can reformulate the problem, explore approaches, discard strategies, before elaborating the answer. This internal reasoning consumes additional tokens — so-called reasoning tokens — which are billed but not shown to the user.

The approach is partially inspired by techniques like Tree of Thoughts and MCTS but integrated at training level, not prompt level.

Where It Excels

Benchmarks tell a clear story:

On maths problems like AIME, o1-preview scores 83% vs GPT-4o’s 13%.
On PhD-level physics problems, o1 reaches 78% vs 57% of human experts.
In programming competitions like Codeforces, o1-preview sits at 89th percentile vs GPT-4o’s 11th.

For tasks requiring long chains of reasoning — olympiad maths, proofs, complex debugging, deep causal analysis — the qualitative leap is real and not just statistical.

Where It Doesn’t Add Much

For conversational tasks, creative writing, simple summaries, direct factual questions, o1 offers no significant advantage over GPT-4o, while adding latency and cost. A response GPT-4o generates in one second may take ten to twenty in o1 while it “thinks”.

Additionally, o1 has architectural limitations in the preview version:

Doesn’t support function calling the same way as GPT-4o.
Doesn’t stream.
Isn’t multimodal.

The Cost Factor

o1-preview costs $15 per million input tokens and $60 per million output. o1-mini is cheaper — $3 input, $12 output — and is typically the sweet spot for most uses benefiting from reasoning. For comparison, GPT-4o costs $2.50 and $10 respectively.

Real cost is higher than nominal because hidden reasoning tokens consume output billing. An apparently short response may have internally consumed ten times the visible tokens.

When to Use and When Not

The pragmatic rule: for problems where the answer requires several chained reasoning steps and where correctness matters more than speed, o1 is worth it:

Complex technical research with multiple interdependent variables.
Legal analysis with multiple premises and exceptions.
Maths or programming problems with rich structure.
Strategic planning with chained-consequence decisions.

For chatbots, content generation, summaries, translations, quick questions or function calling, o1 introduces cost and latency without proportional benefit.

Honest Limitations

Internal reasoning isn’t transparent — OpenAI explicitly hides reasoning tokens from users. This generates legitimate concerns about auditing and debugging.
Published benchmarks are somewhat cherry-picked. o1 isn’t universally superior; on many everyday tasks it ties or loses to GPT-4o when normalised for cost.
Open questions about sustainability: if each model generation requires ten times more inferential tokens, carbon footprint and economic cost scale exponentially.

Conclusion

o1 represents an inflection point in how we think about language models. It’s not a universal GPT-4o replacement but a specialised complement for deep reasoning. For problems where correctness matters more than speed, it’s worth every extra cent. For most everyday uses, traditional models remain more efficient. The direction it marks — scaling inference compute for reasoning — is probably the next dominant paradigm. Knowing when to apply it forms part of essential technical repertoire for any engineer integrating LLMs in production.

Was this useful?

[Total: 11 · Average: 4.5]

Post Views: 116

Written by

Javier Cañete

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.

o1-preview: OpenAI’s Model That Thinks Before Answering

Key takeaways

What Makes It Different

Where It Excels

Where It Doesn’t Add Much

The Cost Factor

When to Use and When Not

Honest Limitations

Conclusion

Related posts

“EU AI Act 2026: a technical checklist for Spanish CTOs”

Agent observability with OpenTelemetry GenAI semconv in 2026

How to install and tune oMLX on M5 Max 128 GB

Multi-agent systems: LangGraph vs CrewAI vs Autogen in 2026