Introducing Voice Agents 2.0 with sub-600ms E2E latency
How we rebuilt the entire voice pipeline — from WebSocket streaming to CRM webhook orchestration — to hit 600ms end-to-end.
The Latency Challenge
For a voice AI agent to feel natural, end-to-end (E2E) latency must remain under 600ms. Beyond that threshold, callers perceive a noticeable delay that breaks conversational flow and increases hang-up rates.
Voice Agents 1.0 ran at ~900ms E2E. Version 2.0 hits 580ms consistently under production load.
What Changed
1. Streaming ASR with Partial Hypothesis
Instead of waiting for full utterance detection, we now emit partial hypotheses every 80ms. The LLM intent classifier begins processing the partial transcript before the speaker finishes, shaving ~150ms from the pipeline.
2. Stateful Context Buffers
We eliminated round-trip context retrieval by pre-loading the caller's account state into an in-memory buffer at call start. CRM lookup now takes 0ms during inference.
3. Parallel TTS Streaming
The TTS engine now streams audio chunks in parallel with LLM token generation. The first audio chunk reaches the caller before the LLM finishes generating the response.
4. WebSocket Frame Optimization
We switched from JSON-encoded binary over WebSocket to raw PCM frames with a compact binary header, reducing per-frame overhead from ~40 bytes to 6 bytes.
Results
| Stage | v1.0 | v2.0 |
|---|---|---|
| ASR (first word) | 150ms | 85ms |
| Intent classification | 180ms | 90ms |
| LLM response (first token) | 320ms | 200ms |
| TTS (first audio chunk) | 250ms | 105ms |
| **E2E (perceived)** | **~900ms** | **~580ms** |
Explore more insights from the Verbalyze team
Back to Blog