GPT-Realtime-2 is a speech-to-speech voice model released by OpenAI on May 8, 2026, the second generation of the GPT-Realtime line and the production-grade release of the Realtime API. The model accepts audio, text, and image inputs and produces audio output, supports a 128K-token context window, and brings GPT-5-class reasoning into a low-latency conversational surface. As of release, it leads the Artificial Analysis Big Bench Audio benchmark at 96.6 percent, tied with Google's Gemini 3.1 Flash Live Preview High and roughly 13 percentage points above the prior highest result.
At a glance
- Lab: OpenAI.
- Released: May 8, 2026 (generally available on the Realtime API at launch).
- Modality: Native speech-to-speech with audio, text, and image input and audio output.
- Open weights: No. Closed weights, served exclusively via OpenAI's Realtime API.
- Context window: 128,000 tokens (up from 32,000 in GPT-Realtime-1.5).
- Latency: 1.12 seconds time-to-first-audio at minimal reasoning effort; 2.33 seconds at high reasoning effort.
- Pricing: $1.15 per hour of audio input; $4.61 per hour of audio output. Pricing held constant against the prior model generation.
- Distribution: Realtime API for developers; ChatGPT voice mode integration announced as "pending."
Origins
The GPT-Realtime line began as an experimental October 2024 release of speech-to-speech endpoints alongside the broader GPT-4o family. The 1.0 generation was framed as a developer beta, the 1.5 generation released in mid-2025 added improved interruption handling and a higher-quality voice palette, and the 2.0 release in May 2026 was the first to be marketed as production-grade. The transition from beta to production paralleled the surge in real-time-voice-application interest among enterprise developers (customer support, sales agents, dictation pipelines, live translation) that 2024 and 2025 made visible.
The 2.0 release came packaged with two companion models. GPT-Realtime-Translate is a live-dubbing model designed for cross-language voice translation in conversational and broadcast contexts; the launch demo paired with Vimeo to dub a live video stream into Spanish in real time. GPT-Realtime-Whisper is a successor speech-recognition model, the first major Whisper-family release since Whisper Large v3 in November 2023. The three models share the GPT-Realtime-2 backbone and are positioned by OpenAI as a complete suite for real-time voice applications.
Sam Altman framed the release around a behavioural-shift thesis: that users are more willing to dump complex multi-turn context into voice channels than into text channels, and that GPT-5-class reasoning at conversational latency unlocks use cases that prior generations of voice assistants could not serve. Greg Brockman noted in the launch material that real-time voice-to-voice translation has been a long-anticipated capability inside OpenAI since the founding period.
Capabilities
The 2.0 generation's headline capability additions over the 1.5 line are concentrated in conversational realism and reasoning controllability rather than in voice quality itself, which was already strong in 1.5.
Preambles let the model emit short bridging phrases ("let me check that," "one moment") before substantive responses, which closes the dead-air gap that prior real-time models had during tool-call latency. Parallel tool calls run with audible transparency: the model says "checking your calendar" while the calendar API call is in flight, and "and looking at your email" if a second tool call runs concurrently. Both features narrow the perceived-latency gap between a model agent and a human assistant who is also looking things up.
Interruption recovery is improved, both at the conversation level (handling user interruptions gracefully) and at the failure-mode level (handling failed tool calls or ambiguous inputs without breaking conversational flow). The model also includes domain-tuning improvements for specialised terminology (healthcare, finance, legal jargon, proper nouns) that prior real-time models tended to mangle.
Two of the larger behavioural improvements are on the controllability side. Tone and delivery are adjustable through prompt instructions: callers can request "calm and empathetic" or "upbeat and energetic" delivery and the model adapts within the requested envelope. Reasoning effort is a five-level dial (minimal, low, medium, high, xhigh) which trades latency for thoughtfulness; the default is low. The reasoning dial inherits its semantics from the GPT-5 family's reasoning-effort parameter and produces a measurable latency-and-quality trade-off across the range.
The image-input modality is new to the GPT-Realtime line. The model accepts images alongside the audio and text inputs, which extends the surface area to visual question-answering in a conversational voice context (a customer-support agent looking at a photograph the caller uploaded, for example).
Benchmarks and standing
Two benchmarks anchor the release narrative.
Artificial Analysis Big Bench Audio, the standard audio-modality benchmark suite, places GPT-Realtime-2 at 96.6 percent, tied with Gemini 3.1 Flash Live Preview High. The 15.2-point improvement over the prior GPT-Realtime score is the largest single-release jump in the benchmark's history. The Conversational Dynamics (Full Duplex) subset, which measures interrupt-handling and turn-taking quality, runs at 96.1 percent even on the minimal-reasoning variant.
Scale AI Audio MultiChallenge S2S, a more recent benchmark focused on instruction retention across turns, shows the largest comparative gain: GPT-Realtime-2 holds 70.8 percent average-per-response instruction retention, against 36.7 percent for GPT-Realtime-1.5. The benchmark is harder than Big Bench Audio because it requires the model to retain instructions delivered earlier in the conversation and execute on them later, which has been a weak point of prior real-time voice models.
Benchmark leadership at the audio modality is point-in-time. The Gemini 3.1 Flash Live Preview tie suggests the frontier-grade real-time voice category has two viable products as of May 2026 and that the gap to the next tier (ElevenLabs, Sesame, Kyutai's Moshi) is meaningful at the conversational-instruction-following level.
Access and pricing
GPT-Realtime-2 is available on OpenAI's Realtime API at launch. Pricing is held constant against the prior generation: $1.15 per hour of audio input and $4.61 per hour of audio output. The pricing is roughly four times the audio-output cost of the GPT-4o-mini Realtime tier, and is competitive with ElevenLabs's conversational-AI pricing tier but more expensive than open-weight or self-hosted realtime stacks.
The model is available to all paying API customers without a separate access program. ChatGPT voice mode upgrades to the new model are listed as "pending" with no committed timeline as of the May 2026 announcement.
Comparison
Gemini 3.1 Flash Live Preview is the closest direct competitor, tying on Big Bench Audio at 96.6 percent. The Gemini line ships with broader integration into Google's product surfaces but the API-tier pricing comparison favours OpenAI on input cost and Google on output cost depending on the specific tier. Both models are closed-weight and API-only.
xAI's Grok Voice, available through the xAI consumer products, is positioned in the same conversational-voice category but has not posted comparable benchmark results as of release; its differentiation is on persona and integration with the broader xAI product line rather than on voice-quality leadership.
ElevenLabs's Eleven v3 is the dominant text-to-speech and conversational-AI provider on the open API market but is not a fully end-to-end speech-to-speech model in the same architectural sense; it pairs a text language model with a separate TTS layer, which produces a measurable latency floor that the speech-to-speech architecture of GPT-Realtime-2 sits below.
Kyutai's Moshi is the closest open-weight competitor (full-duplex speech-to-speech, French-origin research) but operates at a smaller scale and has not pushed for benchmark leadership at the same level. Sesame, a newer voice-AI insurgent, has been compared at the consumer-product level but has not posted comparable API-tier benchmark numbers.
Outlook
Three open questions matter over the next six to eighteen months.
The first is whether OpenAI maintains the Big Bench Audio lead against Google. The two models are tied on a single benchmark snapshot. The next refresh of either model line will likely break the tie in one direction or the other, and the direction matters for the developer-mindshare narrative in the real-time-voice category.
The second is the ChatGPT integration timeline. The "pending" framing on consumer voice mode is unusual for OpenAI launches, where API release and consumer-product integration typically happen in close synchrony. Whether the gap is engineering-driven (the Realtime API is hard to scale to ChatGPT's user volume) or product-strategy-driven (consumer voice is treated as a different product than developer Realtime) will shape how aggressively OpenAI pursues the consumer-voice market against competitors like Grok Voice and Pi.
The third is whether the production-grade designation produces a step change in enterprise deployment. Real-time voice has been one of the most commonly cited "almost there" categories in enterprise AI procurement, with companies running pilots through 2024 and 2025 but holding back on production rollouts due to latency, reliability, and instruction-retention concerns. GPT-Realtime-2's combination of better instruction-following (Scale AI MultiChallenge) and lower interaction friction (preambles, parallel tool calls) addresses the most-cited concerns directly. The 2026-Q3 enterprise-procurement data will be the test of whether the technical improvements translate to actual deployment volume.
Sources
- GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper announcement. AI News coverage of the May 8, 2026 launch, with benchmark numbers, pricing, and partner integration details.
- OpenAI Realtime API documentation. API reference for the speech-to-speech endpoint.
- Artificial Analysis Big Bench Audio leaderboard. Independent benchmark for audio-modality model quality.
- Scale AI Audio MultiChallenge S2S. Independent benchmark for instruction-retention in real-time voice conversations.
- Companion essay: The diaspora map for the talent-flow patterns that shape OpenAI's research-leadership composition during this release cycle.
- Companion profile: GPT-5.5 for the underlying reasoning-model family that GPT-Realtime-2 inherits its reasoning effort dial from.