GPT-4 In Depth: Real Capabilities vs Expectations

Interfaz digital representando inteligencia artificial

In March 2023, OpenAI launched GPT-4 with a presentation that promised “human-level performance on many benchmarks”. Five months later, with thousands of real integrations, we’re in a better position to judge: which capabilities held up, which were oversold, and where gaps remain versus alternatives like Claude 2 and LLaMA 2.

Where GPT-4 Really Excels

With the benchmark expanded to dozens of real cases, GPT-4 is consistently better at:

  • Complex chained reasoning. On problems requiring maintaining multiple variables, conditionals, and intermediate steps, GPT-4 hallucinates less and holds coherence better than any other currently available model.
  • Precise technical writing. Generating documentation, paper summaries, step-by-step explanations of complex concepts — especially in domains like programming, quantitative finance, or medicine — GPT-4 produces text requiring few editorial corrections.
  • Medium-complexity code. Not infallible, but on tasks like “refactor this function”, “write tests for this component”, or “explain what this legacy code does”, it clearly beats GitHub Copilot when the IDE suggestion isn’t enough.
  • Following very detailed instructions. A prompt with 15 specific constraints (“respond in JSON with these keys, don’t mention X, limit to 100 words”) is followed much more faithfully than with GPT-3.5.

Where It Remains Frustrating

But there are areas where GPT-4, despite the marketing, still systematically fails:

Arithmetic

Surprisingly, GPT-4 makes trivial arithmetic errors relatively often. “What is 2394 × 71?” gives wrong results ~30% of the time. This improves dramatically with Code Interpreter, which runs actual Python for calculations.

Post-training-cutoff information

The model was trained on data up to September 2021 (or April 2023 for updated versions). Questions about recent events, current library versions, or fresh news result in outdated or fabricated info. With the browsing plugin this improves, but latency and reliability drop.

Consistency across conversations

The same question in two different conversations can get significantly different answers. For use cases requiring determinism (audits, reproducible validations), this forces more rigid prompting techniques or dropping temperature to 0 — which sometimes degrades quality.

Large numbers and very long contexts

With 8k-32k tokens context (depending on version), GPT-4 stumbles when there’s a lot of text. The famous “lost in the middle” by Liu et al. shows models ignore information placed in the middle of a long context. Claude 2 with 100k tokens does slightly better, but the problem exists in both.

GPT-4 vs Claude 2

Claude 2, released by Anthropic in July 2023, brings some notable advantages:

  • 100k-token context window. Loads entire books, technical manuals, or long transcripts in a single prompt.
  • More conservative tone. Less prone to exaggeration or invention; when it doesn’t know, it typically admits so.
  • Stricter built-in safety. For applications where minimising problematic responses matters, Claude 2 fails in fewer edge cases.

Where GPT-4 wins: complex code, symbolic mathematics, multi-step reasoning on problems with more than 5 interacting entities.

GPT-4 vs LLaMA 2 70B

Compared with LLaMA 2 70B:

  • GPT-4 clearly wins on complex reasoning and code.
  • LLaMA 2 70B is competitive on summarisation, classification, simple Q&A.
  • LLaMA 2 has the absolute edge on privacy, cost at scale, and customisation.

For any task where LLaMA 2 70B gives “acceptable” results, it’s almost always the better choice — the quality delta rarely justifies the cost/privacy tradeoff of GPT-4.

Domain Evaluation

The only benchmark that matters is yours. A practical process for evaluating GPT-4 vs alternatives:

  1. Select 20-30 representative prompts from your real application, with human-annotated “ideal” answers.
  2. Run each prompt through GPT-4, Claude 2, LLaMA 2 and record responses.
  3. Evaluate blindly (without knowing which model produced what): which came closest to the ideal?
  4. Quantify total cost: per-token price × expected volume + operational overhead.

This process, covered in more detail in prompt engineering as a mature discipline, often reveals surprises — sometimes Claude 2 wins where you expected GPT-4, or LLaMA 2 gives sufficient results at 1/10 the cost.

Responsible Use

A dimension worth remembering: GPT-4 generates plausible text even when wrong. For applications with real impact (medical, legal, financial decisions), model output must go through human validation or independent verification systems. The model has no way to know when it’s confidently wrong, and that’s dangerous in unsupervised flows.

Conclusion

GPT-4 is the most capable general-purpose model available in 2023, but “most capable” doesn’t mean “best choice for everything”. Mature teams evaluate by use case, not by model reputation. In many scenarios, Claude 2 or LLaMA 2 offer better value/cost ratios; in others, GPT-4 remains the unsurpassed standard. Team sophistication is measured in knowing which is which.

Follow us on jacar.es for more on LLMs, AI evaluation, and product architecture.

Entradas relacionadas