Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.

Inteligencia Artificial

gpt-4o multimodal openai real-time vision voice

GPT-4o: OpenAI’s Native Multimodality

June 1, 2024 6 min read 110 reads

Table of contents

Key takeaways
What’s different
Unlocked use cases
The Realtime API
Honest limitations
Conclusion

Actualizado: 2026-05-03

GPT-4o (“o” = omni) was presented by OpenAI on May 13, 2024. What’s new isn’t that GPT-4 can process image and audio — that existed via separate APIs — but that a single native model processes text, image, and audio for both input and output. Result: human-conversation latency (~320ms), better multimodal understanding, and 50% lower price than GPT-4 Turbo.

Key takeaways

The three modalities (text, vision, audio) are fused in the same base model: the model hears tone, emotion, and interruptions, not just transcribed words.
~50% cheaper than GPT-4 Turbo: $5/1M input tokens vs $10.
~320ms latency in audio mode: genuinely natural conversation.
HumanEval 90%: best coding result among models available in May 2024.
The Realtime API bidirectional WebSocket opens voice-first applications that previously required complex pipelines.

What’s different

All three modalities are now fused in the same base model. The difference vs the “Whisper → GPT-4 → TTS” pipeline is latency and information preservation: the previous pipeline lost voice tone, emotion, and pauses. GPT-4o processes them directly.

Unlocked use cases

GPT-4o makes practical what previously required a complex pipeline: voice assistants with real conversational latency, accessibility interfaces for people with visual or motor limitations, simultaneous translation in meetings with tone preservation, complex document understanding in a single API call, UI automation from screenshots.

The Realtime API

The most impactful post-launch novelty: a bidirectional WebSocket connection with audio streaming. The client can interrupt, the server detects the interruption and adjusts. Real 320ms latency. Opens patterns previously impractical: phone support bots, voice interfaces for IoT, interactive tutorials with instant feedback.

Honest limitations

Audio bandwidth is ~24 kHz. Tool use + audio in the same call is more complex than text mode. Vision hallucinations occur more frequently with low-quality images. Context window is 128k tokens. Audio cost can accumulate quickly.

Conclusion

GPT-4o is the reset point for evaluating what can now be built with multimodal LLMs. The Realtime API for voice is the real differentiator for voice-first applications. The frontier LLM landscape moves fast, but GPT-4o is a genuine leap in price/quality/modality.

Was this useful?

[Total: 0 · Average: 0]

Post Views: 110

Written by

Javier Cañete

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.

GPT-4o: OpenAI’s Native Multimodality

Key takeaways

What’s different

Unlocked use cases

The Realtime API

Honest limitations

Conclusion

Related posts

“EU AI Act 2026: a technical checklist for Spanish CTOs”

Agent observability with OpenTelemetry GenAI semconv in 2026

How to install and tune oMLX on M5 Max 128 GB

Multi-agent systems: LangGraph vs CrewAI vs Autogen in 2026