Jacar mascot — reading along A laptop whose eyes follow your cursor while you read.
Inteligencia Artificial

GPT-4o: OpenAI’s Native Multimodality

GPT-4o: OpenAI’s Native Multimodality

Actualizado: 2026-05-03

GPT-4o (“o” = omni) was presented by OpenAI on May 13, 2024. What’s new isn’t that GPT-4 can process image and audio — that existed via separate APIs — but that a single native model processes text, image, and audio for both input and output. Result: human-conversation latency (~320ms), better multimodal understanding, and 50% lower price than GPT-4 Turbo.

Key takeaways

  • The three modalities (text, vision, audio) are fused in the same base model: the model hears tone, emotion, and interruptions, not just transcribed words.
  • ~50% cheaper than GPT-4 Turbo: $5/1M input tokens vs $10.
  • ~320ms latency in audio mode: genuinely natural conversation.
  • HumanEval 90%: best coding result among models available in May 2024.
  • The Realtime API bidirectional WebSocket opens voice-first applications that previously required complex pipelines.

What’s different

All three modalities are now fused in the same base model. The difference vs the “Whisper → GPT-4 → TTS” pipeline is latency and information preservation: the previous pipeline lost voice tone, emotion, and pauses. GPT-4o processes them directly.

Unlocked use cases

GPT-4o makes practical what previously required a complex pipeline: voice assistants with real conversational latency, accessibility interfaces for people with visual or motor limitations, simultaneous translation in meetings with tone preservation, complex document understanding in a single API call, UI automation from screenshots.

The Realtime API

The most impactful post-launch novelty: a bidirectional WebSocket connection with audio streaming. The client can interrupt, the server detects the interruption and adjusts. Real 320ms latency. Opens patterns previously impractical: phone support bots, voice interfaces for IoT, interactive tutorials with instant feedback.

Honest limitations

Audio bandwidth is ~24 kHz. Tool use + audio in the same call is more complex than text mode. Vision hallucinations occur more frequently with low-quality images. Context window is 128k tokens. Audio cost can accumulate quickly.

Conclusion

GPT-4o is the reset point for evaluating what can now be built with multimodal LLMs. The Realtime API for voice is the real differentiator for voice-first applications. The frontier LLM landscape moves fast, but GPT-4o is a genuine leap in price/quality/modality.

Was this useful?
[Total: 0 · Average: 0]

Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.