Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.

Inteligencia Artificial Tecnología

benchmarks evaluacion gpt-4 ia generativa llm openai

GPT-4 In Depth: Real Capabilities vs Expectations

August 6, 2023 9 min read 129 reads

Table of contents

Key Takeaways
Where GPT-4 Really Excels
Where It Remains Frustrating
Arithmetic
Post-training-cutoff information
Consistency across conversations
Very long contexts
GPT-4 vs Claude 2
GPT-4 vs LLaMA 2 70B
Domain Evaluation
Responsible Use
Conclusion

Actualizado: 2026-05-03

In March 2023, OpenAI launched GPT-4 with a presentation that promised “human-level performance on many benchmarks”. Several months later, with thousands of real integrations, we’re in a better position to judge: which capabilities held up, which were oversold, and where gaps remain versus alternatives like Claude 2 and LLaMA 2.

Key Takeaways

GPT-4 consistently excels at chained reasoning, precise technical writing, medium-complexity code, and detailed instruction-following.
It systematically fails at arithmetic, post-cutoff information, cross-conversation consistency, and very long contexts.
Claude 2 wins on long context (100k tokens) and conservative tone; LLaMA 2 wins on privacy, cost, and customisation.
The only benchmark that matters is yours: evaluate with 20–30 real prompts.
The model generates plausible text even when wrong — unsupervised flows with real impact are dangerous.

Where GPT-4 Really Excels

With the benchmark expanded to dozens of real cases, GPT-4 is consistently better at:

Complex chained reasoning. On problems requiring maintaining multiple variables, conditionals, and intermediate steps, GPT-4 hallucinates less and holds coherence better than any other model available at the time.
Precise technical writing. Generating documentation, paper summaries, step-by-step explanations of complex concepts — especially in domains like programming, quantitative finance, or medicine — produces text requiring few editorial corrections.
Medium-complexity code. Not infallible, but on tasks like “refactor this function”, “write tests for this component”, or “explain what this legacy code does”, it clearly beats GitHub Copilot when the IDE suggestion isn’t enough.
Following very detailed instructions. A prompt with 15 specific constraints (“respond in JSON with these keys, don’t mention X, limit to 100 words”) is followed much more faithfully than with GPT-3.5.

Where It Remains Frustrating

But there are areas where GPT-4, despite the marketing, still systematically fails:

Arithmetic

Surprisingly, GPT-4 makes trivial arithmetic errors relatively often. “What is 2394 × 71?” gives wrong results ~30% of the time. This improves dramatically with Code Interpreter, which runs actual Python for calculations.

Post-training-cutoff information

The model was trained on data up to a specific cutoff date. Questions about recent events, current library versions, or fresh news result in outdated or fabricated information. With the browsing plugin this improves, but latency and reliability drop.

Consistency across conversations

The same question in two different conversations can get significantly different answers. For use cases requiring determinism (audits, reproducible validations), this forces more rigid prompting techniques or dropping temperature to 0 — which sometimes degrades quality.

Very long contexts

With 8k–32k tokens context (depending on version), GPT-4 stumbles when there’s a lot of text. The “lost in the middle”^[1] by Liu et al. shows models ignore information placed in the middle of a long context. Claude 2 with 100k tokens does slightly better, but the problem exists in both.

GPT-4 vs Claude 2

Claude 2^[2], released by Anthropic in July 2023, brings some notable advantages:

100k-token context window. Loads entire books, technical manuals, or long transcripts in a single prompt.
More conservative tone. Less prone to exaggeration or invention; when it doesn’t know, it typically admits so.
Stricter built-in safety. For applications where minimising problematic responses matters, Claude 2 fails in fewer edge cases.

Where GPT-4 wins: complex code, symbolic mathematics, multi-step reasoning on problems with more than 5 interacting entities.

GPT-4 vs LLaMA 2 70B

Compared with LLaMA 2 70B:

GPT-4 clearly wins on complex reasoning and code.
LLaMA 2 70B is competitive on summarisation, classification, simple Q&A.
LLaMA 2 has the absolute edge on privacy, cost at scale, and customisation.

For any task where LLaMA 2 70B gives “acceptable” results, it’s almost always the better choice — the quality delta rarely justifies the cost/privacy tradeoff of GPT-4.

Domain Evaluation

The only benchmark that matters is yours. A practical process for evaluating GPT-4 vs alternatives:

Select 20–30 representative prompts from your real application, with human-annotated “ideal” answers.
Run each prompt through GPT-4, Claude 2, LLaMA 2 and record responses.
Evaluate blindly (without knowing which model produced what): which came closest to the ideal?
Quantify total cost: per-token price × expected volume + operational overhead.

This process often reveals surprises — sometimes Claude 2 wins where you expected GPT-4, or LLaMA 2 gives sufficient results at 1/10 the cost. The same process applies when evaluating Bard with PaLM 2 as an alternative from the Google ecosystem.

Responsible Use

A dimension worth remembering: GPT-4 generates plausible text even when wrong. For applications with real impact (medical, legal, financial decisions), model output must go through human validation or independent verification systems. The model has no way to know when it’s confidently wrong, and that’s dangerous in unsupervised flows.

Conclusion

GPT-4 is the most capable general-purpose model available when this analysis is written, but “most capable” doesn’t mean “best choice for everything”. Mature teams evaluate by use case, not by model reputation. In many scenarios, Claude 2 or LLaMA 2 offer better value/cost ratios; in others, GPT-4 remains the unsurpassed standard. Team sophistication is measured in knowing which is which.

Was this useful?

[Total: 12 · Average: 4.7]

Post Views: 129

Written by

Javier Cañete

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.