Business

SaaS

Technology

•

May 15, 2026

GPT-Realtime-2 and the New Voice Model Tier

On May 7, 2026, OpenAI dropped three new voice models at once — GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper — all running on GPT-5-class reasoning. GPT-Realtime-2 is the headline act. It replaces the assembled STT/LLM/TTS pipeline that most voice agents run on today with a single native speech-to-speech model. Whether you’re part of the broader wave of voice agents hitting production or deciding whether to upgrade an existing pipeline, this article covers what the architecture change actually means, how to read the latency claims without getting misled, and where the pricing lands when you do the per-call maths.

What Is GPT-Realtime-2 and Why Does the May 7, 2026 Release Matter?

GPT-Realtime-2 is OpenAI’s second-generation native speech-to-speech model. Audio in, audio out, no text conversion in between. It landed alongside GPT-Realtime-Translate and GPT-Realtime-Whisper — the first time OpenAI released a complete voice model family all at once.

The predecessor, GPT-Realtime-1.5, was essentially a speech wrapper around GPT-4o-era reasoning. GPT-Realtime-2 runs on GPT-5’s reasoning stack. That jump shows in the benchmarks: Scale AI‘s Audio MultiChallenge APR went from 36.7% to 70.8% — nearly double. On Artificial Analysis‘s Big Bench Audio benchmark it hits 96.6%, a new state-of-the-art for native speech-to-speech reasoning.

The context window expanded from 32K to 128K. That’s hours of continuous dialogue history without external memory systems. GPT-Realtime-2 also adds parallel tool calls, five configurable reasoning effort levels, and a preamble feature that solves a specific production failure mode we’ll get to below.

One thing worth clarifying upfront: ChatGPT Voice Mode was not upgraded to GPT-Realtime-2 at launch. This is API-only capability — WebSocket, WebRTC, SIP, the Chat Completions endpoint, and the OpenAI Agents SDK. Don’t draw conclusions about the API from how the consumer product behaves.

What Does “Native Speech-to-Speech” Mean — and Why Does It Change the Latency Equation?

Native speech-to-speech means audio goes in and audio comes out as a single unified operation. No separate transcription stage, no LLM text generation, no TTS synthesis on the back end.

The assembled pipeline that most voice agents run on today chains three discrete components — STT, LLM, and TTS — each one adding delay before the next can start. Rough breakdown: STT around 200ms, LLM inference around 500ms, TTS synthesis around 150ms, plus network overhead. Total somewhere between 800ms and 2 seconds depending on where your components live. Native speech-to-speech collapses all of that into one pass.

There’s another advantage that tends to get overlooked. Native speech-to-speech preserves tonal and prosodic information that text intermediation throws away. The model picks up on tone, pacing, and hesitation. It handles interruptions and turn-taking natively — if a user starts talking mid-response, the model stops and responds appropriately.

The trade-off is modularity. With native speech-to-speech you can’t swap out your STT or TTS components. There’s no fine-tuning support at launch either. If you need domain-specific vocabulary — medical terminology, legal language, proprietary product names — an assembled pipeline is currently your only option.

What Are the Three Models in the New Voice Family — and Which One Is Right for Your Use Case?

The May 7 release introduced three purpose-built models. They’re not interchangeable. Each one targets a distinct slot in the voice stack.

GPT-Realtime-2 is the full conversational model. Text, audio, and image input; text and audio output; 128K context; parallel tool calls; configurable reasoning effort. Priced at $32/M audio input tokens and $64/M audio output tokens (CloudPrice data at cloudprice.net/models/openai-gpt-realtime-2).

GPT-Realtime-Translate is a speech-to-speech translation model, not a conversational model. It covers 70+ input languages and 13 output languages in real time at $0.034/min. It’s built for live interpretation and multilingual enterprise deployment — not for conversational Q&A.

GPT-Realtime-Whisper is a streaming speech-to-text transcription model at $0.017/min. It fills the STT slot in assembled pipelines and competes directly with Deepgram Nova-3 for that role. Claims approximately 90% fewer hallucinations than Whisper v2.

GPT-Realtime-Mini and GPT-Realtime-Nano are lower-cost siblings for high-volume workloads where you don’t need full GPT-5-class reasoning depth. Pricing hasn’t been published as of this writing.

The short version: use GPT-Realtime-2 for end-to-end conversation with reasoning. Use GPT-Realtime-Translate for real-time multilingual translation. Use GPT-Realtime-Whisper when you only need a streaming STT component.

What Is TTFA and How Is It Different from End-to-End Latency — and Why Does This Distinction Matter?

Vendor marketing and independent benchmarks often cite different latency numbers for the same model. You need to know which measurement each one is talking about before any of those numbers mean anything.

TTFA — Time to First Audio — is the elapsed time from when the user stops speaking until the model begins producing audio. This is what Artificial Analysis measures. End-to-end latency is the full round-trip: audio in to audio out complete, including response duration. The production standard is approximately 600ms — above that, the conversation starts to feel laggy.

GPT-Realtime-2’s TTFA is 1.12 seconds at minimal reasoning effort and 2.33 seconds at high reasoning. MindStudio cites sub-500ms latency; Artificial Analysis documents 1.12–2.33s TTFA. Both can be simultaneously accurate — sub-500ms figures likely represent end-to-end latency for short responses at minimal reasoning. Conflate them in a production evaluation and you’ll draw the wrong conclusions.

Reasoning effort level is a direct dial for the TTFA-versus-accuracy trade-off. The five levels: minimal (1.12s TTFA), low (the default, recommended for tight latency), medium, high (2.33s TTFA), and xhigh. The xhigh setting hits 48.5% on Audio MultiChallenge but at a 2.33s TTFA. For most customer service workloads, “low” is the right default.

The preamble feature handles the silence gap that TTFA creates in practice. During tool-call execution, the model emits short filler phrases — “let me check that,” “one moment.” It’s a native API capability with no custom prompt scaffolding required. Callers who hear silence for more than a second assume the call dropped or start talking over the agent, corrupting conversation state. Preamble prevents that.

Parallel tool calls extend this further: the model calls multiple back-end APIs simultaneously and narrates what it’s doing while those calls execute. This eliminates compound latency from sequential tool calls. For a deeper look at how GPT-Realtime-2 changes the latency ceiling, the architecture breakdown covers the full mechanism.

What Do Independent Benchmarks and Launch Partner Data Say About GPT-Realtime-2 in Production?

GPT-Realtime-2 is arriving at a moment when voice agents are crossing the production threshold across enterprise verticals — which makes benchmark credibility especially important. The evidence comes from two sources that carry different weight, and it’s worth being clear about which is which.

Third-party benchmarks from Scale AI and Artificial Analysis are independently credible. Scale AI APR: GPT-Realtime-2 at 70.8% vs GPT-Realtime-1.5 at 36.7%, measuring instruction retention across complex multi-turn audio tasks. Artificial Analysis Big Bench Audio: 96.6% vs 81.4% on GPT-Realtime-1.5, setting a new state-of-the-art. Artificial Analysis Conversational Dynamics: 96.1%.

Launch partner data comes from OpenAI’s own release materials — treat it as directional, not independently validated. Zillow: 26-point lift in call success rate on an adversarial benchmark, going from 69% to 95%. Glean: 42.9% relative increase in helpfulness for real-time organisational voice queries. Genspark: +26% effective conversation rate and fewer dropped calls. BolnaAI: 12.5% lower word error rate on Hindi, Tamil, and Telugu — the primary non-English production data point available.

Worth noting: those benchmark scores are achieved at xhigh reasoning, which produces a 2.33s TTFA. Production deployments at minimal or low reasoning will see different accuracy outcomes.

How Does GPT-Realtime-Translate Compare to Deepgram Flux for Multilingual Enterprise Deployments?

GPT-Realtime-Translate covers 70+ input languages and 13 output languages at $0.034/min for real-time speech-to-speech translation with no text intermediation.

Deepgram Flux Multilingual reached general availability on April 29, 2026 — one week before the GPT-Realtime-2 release. It covers 10 languages and is STT-only, built on Deepgram’s CSR (Conversational Speech Recognition) approach — designed to understand dialogue flow rather than just transcribe — with model-based turn detection under 400 milliseconds.

These are architecturally distinct products. GPT-Realtime-Translate is a self-contained translation engine. Deepgram Flux Multilingual is a transcription component that needs separate LLM and TTS layers to complete a voice agent. Language count (70 vs 10) is a starting filter, not a final criterion — dialect-level accuracy, accent handling, and code-switching behaviour are the real differentiators, and existing benchmarks don’t surface those.

Deepgram Flux also offers self-hosted deployment with EU endpoints. For regulated environments where data can’t leave a specific geography, that’s a structural advantage that GPT-Realtime-Translate’s cloud-only API simply can’t match.

For the full comparison, see Deepgram Flux and the competing multilingual approach.

What Does GPT-Realtime-2 Actually Cost — and When Does the Pricing Make Sense for Production?

CloudPrice data: $32/M audio input tokens, $64/M audio output tokens, $0.40/M cached audio input, $4/M text input, $24/M text output. Pricing is unchanged from GPT-Realtime-1.5 despite the intelligence upgrade — which is the right way to launch a successor model.

Per-call economics matter more than per-token pricing. A 20-minute customer service call at default settings runs approximately $0.38 in audio input and $1.54 in audio output — around $1.92 in inference costs, before orchestration, telephony, or platform fees. Build a per-call cost model before you commit to anything.

At roughly $1.92 per 20-minute call, GPT-Realtime-2 works for high-margin verticals — real estate, FinTech, HealthTech — where call values justify the inference cost. It doesn’t work for low-margin, high-volume workloads: food delivery support, retail self-service, ride-hailing.

The total cost of ownership comparison is more nuanced than it first looks. An assembled pipeline carries platform fees, Deepgram costs, ElevenLabs costs, LLM inference, and orchestration overhead — multiple integrations to maintain separately. GPT-Realtime-2 consolidates all of that into one API call. If your assembled pipeline’s total cost already exceeds $1.92 per 20-minute call, GPT-Realtime-2 may reduce costs while improving capability at the same time.

When Should You Evaluate GPT-Realtime-2 — and When Does the Assembled Pipeline Still Win?

GPT-Realtime-2 warrants a technical evaluation when:

Your calls are long (20+ minutes) and need sustained context — the 128K window handles hours of dialogue without external memory
Your workload is in a high-margin vertical where per-call inference cost is absorbable
You need low TTFA at minimal or low reasoning effort
You need live multilingual translation as a first-class feature
You’re building English-language agents where GPT-5-class reasoning and parallel tool calls are the capability gap

The assembled pipeline still wins when:

You need fine-tuning for domain-specific vocabulary — this is a hard constraint, not a temporary workaround
You need component-level control — the ability to swap your STT or TTS layer independently
Your workload is high-volume and low-margin
You need self-hosted or on-premise deployment — Deepgram Flux offers self-hosted EU endpoints; GPT-Realtime-2 is cloud-only
You need maximum latency control at the infrastructure level

Production gaps to account for before committing: Early developer reports flag unexpected language-switching mid-session and barge-in false positives in noisy environments. There’s no publicly documented migration path from assembled pipelines to GPT-Realtime-2 as of this writing — and that absence is itself a planning constraint worth taking seriously.

The 128K context window eliminates within-session vector retrieval. It does not replace cross-session memory — returning callers still need external storage.

For engineers ready to turn model selection into an architecture and governance decision, the deployment framework for production voice agents covers infrastructure, orchestration, observability, and compliance across the voice AI production landscape. For a broader view of the broader voice AI production landscape — covering all seven dimensions from model selection to governance — the cluster overview maps the full picture.

Frequently Asked Questions

What is the difference between GPT-Realtime-2 and GPT-Realtime-1.5?

GPT-Realtime-2 uses GPT-5-class reasoning; GPT-Realtime-1.5 used GPT-4o-era reasoning. Context window went from 32K to 128K. Scale AI APR went from 36.7% to 70.8%. Adds parallel tool calls, five reasoning effort levels, and the preamble feature. Released May 7, 2026.

What does “native speech-to-speech” mean?

Audio in, audio out — single model pass, no intermediate text transcription. Eliminates the compounded latency of a three-stage STT/LLM/TTS pipeline and preserves tonal and prosodic information that text conversion discards.

What is TTFA and why is it different from end-to-end latency?

TTFA (Time to First Audio): elapsed time from when the user stops speaking until audio output begins. End-to-end latency: full round-trip including response duration. GPT-Realtime-2 achieves 1.12s TTFA at minimal reasoning; the production standard for end-to-end latency is approximately 600ms. Don’t conflate them when evaluating vendor claims.

What is the preamble feature in GPT-Realtime-2?

A native API capability that emits short filler phrases (“let me check that”) during tool-call execution, preventing silence gaps that cause callers to hang up or speak over the agent. No custom prompt scaffolding required.

What are the five reasoning effort levels and which should I use?

Minimal (1.12s TTFA), low (default), medium, high (2.33s TTFA), xhigh (Audio MultiChallenge 48.5%). Use “low” for most customer service workloads. Use “xhigh” only when maximum instruction retention justifies the higher latency.

Does GPT-Realtime-2 support fine-tuning?

No — not at launch. If you need domain-specific vocabulary adaptation (medical, legal, proprietary product names), an assembled pipeline with a fine-tunable STT or LLM component is currently your only option.

What is the 128K context window and does it replace a vector database?

128K tokens retains hours of spoken dialogue within a session — no within-session vector retrieval needed. It does not replace cross-session memory. Returning callers still require external storage.

How many languages does GPT-Realtime-Translate support?

70+ input languages, 13 output languages, at $0.034/min. It is a translation model, not a conversational model — it does not replace GPT-Realtime-2 in conversation workflows.

How does Deepgram Flux compare to GPT-Realtime-Translate for multilingual voice agents?

Deepgram Flux: 10 languages, STT-only, fits the transcription slot in an assembled pipeline. GPT-Realtime-Translate: 70+ languages, complete speech-to-speech translation model. They’re not substitutes — the choice depends on whether you need end-to-end translation or STT-component multilingual support with pipeline control and data residency options.

Where can I find GPT-Realtime-2 API pricing?

CloudPrice (cloudprice.net/models/openai-gpt-realtime-2): $32/M audio input tokens, $64/M audio output tokens, $4/M text input, $24/M text output. OpenAI’s Realtime API documentation is the authoritative source.

Can GPT-Realtime-2 replace an assembled STT/LLM/TTS pipeline entirely?

For high-margin verticals with long average call durations, it’s a credible replacement candidate. The assembled pipeline retains advantages where fine-tuning, component-level control, on-premise deployment, or sub-200ms full-stack latency are required. Evaluate against your specific workload.

Is ChatGPT Voice Mode using GPT-Realtime-2?

No — ChatGPT Voice Mode was not upgraded at launch. GPT-Realtime-2 is available via the OpenAI API (WebSocket, WebRTC, SIP) only. Don’t infer API behaviour from the consumer product.