Insights Business| SaaS| Technology Voice Agent Latency Solved Enough for Production: Benchmarks and Architecture Tradeoffs
Business
|
SaaS
|
Technology
May 15, 2026

Voice Agent Latency Solved Enough for Production: Benchmarks and Architecture Tradeoffs

AUTHOR

James A. Wondrasek James A. Wondrasek
Graphic representation of the topic Voice Agents Hit Production

Voice AI has crossed a threshold. Production systems now routinely hit ~600ms end-to-end response latency, down from 1.5–2 seconds two years ago. That shift changes the decision for engineering teams working out whether to ship voice AI to real customers.

But that 600ms figure plastered across vendor marketing means different things depending on who’s measuring. Some platforms benchmark TTFA (time to first audio from the TTS engine). Others report full STT-LLM-TTS round trips. And real-world P95 figures diverge significantly from the headline P50 numbers. It’s a slippery number — and teams tend to discover that after deployment, not before.

This article breaks down what 600ms actually means on a live phone call, where latency originates in the dominant pipeline, how the leading platforms compare — with the measurement caveats vendor comparisons leave out — and which architectural alternatives are now worth looking at. Including what voice agents crossing the production threshold actually require.

What Does “600ms Latency” Actually Mean for a Phone Call?

600ms is a binary threshold: below it, conversation feels natural; above 800ms, callers start speaking over the agent at measurably higher rates.

Human back-channel responses average 200–300ms. Production cannot hit 300ms — 600ms is the pragmatic floor. Contact centre data shows 40% more hang-ups above 1 second. Above 1.2 seconds, the experience registers as broken.

So the operational target is reliably below 600ms at P90, with nothing above 1.2s at P95. That percentile framing is what vendor benchmarks typically leave out. Hamming AI’s dataset of 4M+ production calls shows real-world P50 latency of 1.4–1.7 seconds — not the sub-600ms platforms claim. P90 lands at 3.3–3.8 seconds. Platforms report P50 under clean-room conditions. Your users experience P90 under real traffic.

One caveat worth flagging: Trillet‘s April 2026 analysis argues callers can’t actually distinguish 600ms from 900ms in real conversational contexts. Treat 600ms as a target worth hitting — and treat 1.2 seconds as the hard ceiling. That part isn’t contested.

Where Does Latency Come From: Inside the STT/LLM/TTS Pipeline?

The dominant production architecture runs three stages in sequence: STT, LLM inference, then TTS. Each adds latency. This pipeline powers roughly 90% of production voice agents today.

STT: Deepgram Nova-3 achieves ~150ms time-to-first-token. AssemblyAI Universal-2 runs 300–600ms with better accuracy. WhisperX self-hosted lands at 380–520ms.

TTS: This is where the biggest shift has happened. Cartesia Sonic achieves 40–95ms TTFA. ElevenLabs Flash v2.5 hits 75ms. Deepgram Aura-2 runs sub-150ms. TTS has gone from a 2–3 second bottleneck to the smallest part of the problem.

LLM inference — the actual bottleneck. LLM accounts for 60–70% of total pipeline latency. Model selection is the single highest-leverage decision you have. Time-to-first-token by model: Gemini Flash 1.5 ~300ms; Claude 3.5 Haiku ~350ms; GPT-4o-mini ~400ms; GPT-4o ~700ms. Switching from GPT-4o to GPT-4o-mini roughly halves the LLM contribution.

Streaming changes the equation. Piping LLM output tokens directly into TTS as they’re generated eliminates the wait for a complete response — saving 300–600ms without changing any individual component speeds.

VAD (voice activity detection) adds 200–800ms at the front of the pipeline, which benchmarks often leave out entirely. Silence thresholds in the 400–600ms range reduce apparent latency compared to the 800–1,000ms defaults — but at the cost of more false interruptions.

Platform Latency Benchmarks: Retell AI, Telnyx, Vapi, Bland AI, and ElevenLabs

These platforms do not benchmark the same thing. Reading them without the measurement caveats will mislead your evaluation.

Retell AI: ~600ms end-to-end. This is the primary source of the 600ms production standard — measured across 200+ production calls. All-in pricing at $0.07/min. Retell’s proprietary turn-taking model differentiates it from VAD-only platforms on barge-in quality. SOC 2 Type II, HIPAA, and GDPR compliant.

Telnyx: sub-200ms round-trip — but this is not end-to-end latency. The sub-200ms figure is audio-layer RTT, not a full STT-LLM-TTS end-to-end measurement. Telnyx co-locates GPU inference with telephony routing inside its own carrier infrastructure: “Most Voice AI platforms sit on top of someone else’s telephony stack. Telnyx runs the AI within our telephony layer.” Structurally different — but not directly comparable to what other platforms are reporting.

Vapi: 500–650ms with engineering investment. Platform fee is $0.05/min; true production cost runs 0.25–0.33/min. Default settings can add 1.5+ seconds to response time. HIPAA compliance requires a $1,000/month add-on.

Bland AI: ~700–900ms, and that is sometimes acceptable. From $0.14/min, supporting up to 20,000 calls/hour. For high-volume outbound where calls are short and scripted, 700–900ms is often perfectly viable. Framing it as simply “slow” misses the use case.

ElevenLabs: 75ms TTFA — for TTS only. That covers the TTS synthesis component, not a full platform. Treat it as a TTS provider benchmark.

Cost comparison: Retell $0.07/min all-in; Bland $0.14+; Vapi true cost 0.25–0.33/min; PolyAI $150K+/year for enterprise SLA.

TTFA vs. End-to-End Latency: Why These Are Not the Same Number

TTFA (Time to First Audio): the delay from when a TTS engine or end-to-end model receives input to first audio output. A component measurement.

End-to-end latency: from when the caller finishes speaking to when the agent begins speaking — including STT, LLM, TTS, network, and VAD detection. A system measurement.

These are not the same number, and vendor marketing routinely exploits the ambiguity. GPT-Realtime-2’s TTFA is 1.12 seconds at minimal reasoning. Engineers comparing that against Retell’s 600ms end-to-end standard are not comparing the same thing.

Three metrics worth instrumenting independently: (1) component TTFA/TTFT per stage; (2) total end-to-end latency from VAD end-of-turn signal to first audio byte; (3) perceived latency, which is shaped by barge-in handling and preamble phrases.

GPT-Realtime-2’s preamble — a short filler phrase like “Let me check that for you…” produced while tool calls execute — illustrates the perceived/actual distinction. It’s a UX technique, not an actual latency reduction.

Before accepting any vendor’s latency number, ask: TTFA, RTT, or end-to-end? P50 or P95? With or without streaming? Does tool-call latency count?

For a deeper look at how GPT-Realtime-2 changes the architecture.

How Does GPT-Realtime-2 Change the Latency Ceiling?

GPT-Realtime-2 (launched May 7, 2026) is an end-to-end speech-to-speech model: audio in, audio out, no separate STT or TTS stages. This removes the compounding latency of the three-stage pipeline — but does not eliminate all latency sources.

End-to-end models achieve 160–400ms versus the 500–650ms pipeline floor. That gap is architectural. But tool calls still require wait time, and the model cannot begin speaking until they return.

The pipeline retains real advantages. Brand voice requirements mean ElevenLabs voice cloning — unavailable in end-to-end models — can only be served by the pipeline. Components can be upgraded independently. Regulated-industry deployments commonly require per-component audit trails and data residency controls that black-box models can’t satisfy. And itemised per-component cost attribution is harder to do when everything is bundled. GPT-Realtime-2 also commits the stack to OpenAI — Gemini Flash Live is Google’s equivalent and equally non-portable.

What’s emerging in production is a hybrid: end-to-end for simple, low-latency exchanges; pipeline fallback for tool-heavy turns or complex reasoning.

Barge-In Handling: The UX Requirement Latency Benchmarks Miss

A platform with 550ms latency and poor barge-in handling will feel worse to callers than a platform with 650ms and correct interruption handling. Raw latency benchmarks do not capture this.

Barge-in is the ability for a caller to interrupt mid-sentence and have the agent stop, process the interruption, and respond correctly. It’s the single feature most responsible for an agent sounding natural.

VAD-only implementations trigger on any sound above a silence threshold — conflating a full turn, a backchannel (“uh-huh”), a genuine interruption, and a hesitation filler. Turn-taking models examine prosody, semantic content, and pause patterns to distinguish them correctly. Retell AI takes this approach. Simpler platforms rely solely on VAD, producing more false positives and cutoffs.

The less visible failure mode is context retention after barge-in — does the agent retain what it was about to say, or produce a generic response? Harder to measure. More noticeable to callers.

When the Assembled Pipeline Still Wins

The 200–400ms latency advantage of end-to-end models does not determine the right architecture in every context. Five situations where the pipeline remains the right call:

  1. Specific TTS voice requirements. ElevenLabs voice cloning is unavailable in end-to-end models. Brand voice consistency means pipeline only.

  2. Component-level upgradeability. Swap STT or LLM independently as better options emerge, without rebuilding the full stack.

  3. Regulated industries. FinTech and HealthTech deployments commonly require per-component audit trails and data residency controls that black-box models cannot satisfy.

  4. Cost predictability at scale. Pipeline costs are itemised: ASR ~$0.006/min, LLM 0.02–0.10/min, TTS ~0.02/min, orchestration 0.05/min. End-to-end pricing bundles everything, making attribution harder at volume.

  5. High-volume scripted outbound. Bland AI at 700–900ms is a legitimate production choice for outbound campaigns where calls are short and scripted. The 600ms threshold matters most for inbound conversational use cases.

For most teams, managed platforms — Retell AI or Vapi — are the right starting point. The architecture decisions that flow from latency requirements get more consequential at scale. Voice AI latency is solved enough for production today. The engineering work is knowing which benchmark maps to which measurement, setting percentile targets, and picking an architecture that fits your compliance requirements. For more on what it takes to reach production with voice AI — covering models, compliance, failure modes, and governance — the cluster overview maps what else sits between a latency win and a live deployment.

Frequently Asked Questions

Is 600ms response time fast enough for an AI phone call to feel natural?

Generally yes. Below 600ms, conversation quality is acceptable for most inbound use cases. Above 800ms, callers start interrupting more. Above 1.2 seconds, hang-up rates increase materially. Trillet’s April 2026 analysis argues the practical threshold is closer to 1.2–1.5 seconds — a genuine empirical disagreement worth knowing about. The 1.2 second ceiling, though, is not contested.

What is the difference between TTFA and end-to-end latency for a voice agent?

TTFA measures how quickly a TTS engine produces its first audio output. End-to-end latency measures the full journey — from when the caller stops speaking to when the agent begins, including STT, LLM inference, TTS, and network time. GPT-Realtime-2’s 1.12s TTFA is not comparable to Retell AI’s 600ms end-to-end benchmark. They are measuring different things.

Why does my voice AI agent feel slow even though each component looks fast?

Component latencies compound sequentially. VAD end-of-turn detection adds 200–800ms before the pipeline even starts. P95 tail latency is typically 2–3x higher than P50 benchmarks. And longer system prompts increase LLM TTFT — a common cause of latency creep as teams add context without realising it.

What latency do real production voice agents achieve today?

Hamming AI’s production dataset shows P50 latency of 1.4–1.7 seconds and P90 at 3.3–3.8 seconds, despite benchmarks claiming sub-600ms. The gap reflects tool calls, variable network conditions, and complex prompts. Clean-room benchmarks do not reflect production load.

How does Telnyx achieve sub-200ms latency when other platforms take 500–650ms?

Telnyx co-locates GPU inference with telephony routing inside its own carrier infrastructure, eliminating the 20–50ms per hop of conventional architectures. The sub-200ms figure is audio-layer RTT — a different metric to other platforms’ end-to-end benchmarks.

What is barge-in handling and does it affect latency?

Barge-in is the ability to detect and respond correctly to caller interruptions. It does not reduce measured latency but directly affects perceived conversation quality. VAD-only platforms rely on silence thresholds and produce false positives. Turn-taking models analyse prosody and semantic content to distinguish backchannels from real interruptions — and that distinction is audible.

Should I build my own voice AI stack or use a managed platform?

Managed platforms like Retell AI and Vapi accelerate time-to-production. Self-assembled pipelines reduce cost at scale (0.05–0.10/min self-hosted vs 0.10–0.20/min managed) but require component-level expertise. For most teams, managed platforms are the right starting point. Self-assembly makes sense when your requirements demonstrably exceed what managed platforms offer.

What is GPT-Realtime-2 and does it replace the STT/LLM/TTS pipeline?

GPT-Realtime-2 processes audio in and out without separate STT or TTS stages, achieving 160–400ms in benchmarks. It does not replace the pipeline for use cases requiring specific TTS voices, component-level compliance controls, or per-component cost attribution.

How do I measure voice agent latency in production?

Measure end-to-end latency from the VAD end-of-turn signal to the first audio byte delivered to the caller. Track P50, P90, and P95 separately — P50 benchmarks mask the tail latency that matters. Instrument each pipeline stage independently so you can attribute regressions to the specific component responsible.

What happens to voice agent latency as my prompts get longer over time?

Longer system prompts increase LLM TTFT — more context to process before the first token is generated. This is a common cause of latency creep that teams miss. Prompt caching partially mitigates it: Anthropic reports 90% cost reduction on cached prefixes, with TTFT improvement for repeated prompt prefixes as a secondary benefit.

AUTHOR

James A. Wondrasek James A. Wondrasek

SHARE ARTICLE

Share
Copy Link

Related Articles

Need a reliable team to help achieve your software goals?

Drop us a line! We'd love to discuss your project.

Offices Dots
Offices

BUSINESS HOURS

Monday - Friday
9 AM - 9 PM (Sydney Time)
9 AM - 5 PM (Yogyakarta Time)

Monday - Friday
9 AM - 9 PM (Sydney Time)
9 AM - 5 PM (Yogyakarta Time)

Sydney

SYDNEY

55 Pyrmont Bridge Road
Pyrmont, NSW, 2009
Australia

55 Pyrmont Bridge Road, Pyrmont, NSW, 2009, Australia

+61 2-8123-0997

Yogyakarta

YOGYAKARTA

Unit A & B
Jl. Prof. Herman Yohanes No.1125, Terban, Gondokusuman, Yogyakarta,
Daerah Istimewa Yogyakarta 55223
Indonesia

Unit A & B Jl. Prof. Herman Yohanes No.1125, Yogyakarta, Daerah Istimewa Yogyakarta 55223, Indonesia

+62 274-4539660
Bandung

BANDUNG

JL. Banda No. 30
Bandung 40115
Indonesia

JL. Banda No. 30, Bandung 40115, Indonesia

+62 858-6514-9577

Subscribe to our newsletter