GPT-4o: OpenAI’s Native Multimodality
Table of contents
Actualizado: 2026-05-03
GPT-4o (“o” = omni) was presented by OpenAI on May 13, 2024. What’s new isn’t that GPT-4 can process image and audio — that existed via separate APIs — but that a single native model processes text, image, and audio for both input and output. Result: human-conversation latency (~320ms), better multimodal understanding, and 50% lower price than GPT-4 Turbo.
Key takeaways
- The three modalities (text, vision, audio) are fused in the same base model: the model hears tone, emotion, and interruptions, not just transcribed words.
- ~50% cheaper than GPT-4 Turbo: $5/1M input tokens vs $10.
- ~320ms latency in audio mode: genuinely natural conversation.
- HumanEval 90%: best coding result among models available in May 2024.
- The Realtime API bidirectional WebSocket opens voice-first applications that previously required complex pipelines.
What’s different
All three modalities are now fused in the same base model. The difference vs the “Whisper → GPT-4 → TTS” pipeline is latency and information preservation: the previous pipeline lost voice tone, emotion, and pauses. GPT-4o processes them directly.
Unlocked use cases
GPT-4o makes practical what previously required a complex pipeline: voice assistants with real conversational latency, accessibility interfaces for people with visual or motor limitations, simultaneous translation in meetings with tone preservation, complex document understanding in a single API call, UI automation from screenshots.
The Realtime API
The most impactful post-launch novelty: a bidirectional WebSocket connection with audio streaming. The client can interrupt, the server detects the interruption and adjusts. Real 320ms latency. Opens patterns previously impractical: phone support bots, voice interfaces for IoT, interactive tutorials with instant feedback.
Honest limitations
Audio bandwidth is ~24 kHz. Tool use + audio in the same call is more complex than text mode. Vision hallucinations occur more frequently with low-quality images. Context window is 128k tokens. Audio cost can accumulate quickly.
Conclusion
GPT-4o is the reset point for evaluating what can now be built with multimodal LLMs. The Realtime API for voice is the real differentiator for voice-first applications. The frontier LLM landscape moves fast, but GPT-4o is a genuine leap in price/quality/modality.