ProductMay 2026

Introducing Voice Agents 2.0 with sub-600ms E2E latency

How we rebuilt the entire voice pipeline — from WebSocket streaming to CRM webhook orchestration — to hit 600ms end-to-end.

Verbalyze Engineering6 min readProduct

The Latency Challenge

For a voice AI agent to feel natural, end-to-end (E2E) latency must remain under 600ms. Beyond that threshold, callers perceive a noticeable delay that breaks conversational flow and increases hang-up rates.

Voice Agents 1.0 ran at ~900ms E2E. Version 2.0 hits 580ms consistently under production load.

What Changed

1. Streaming ASR with Partial Hypothesis

Instead of waiting for full utterance detection, we now emit partial hypotheses every 80ms. The LLM intent classifier begins processing the partial transcript before the speaker finishes, shaving ~150ms from the pipeline.

2. Stateful Context Buffers

We eliminated round-trip context retrieval by pre-loading the caller's account state into an in-memory buffer at call start. CRM lookup now takes 0ms during inference.

3. Parallel TTS Streaming

The TTS engine now streams audio chunks in parallel with LLM token generation. The first audio chunk reaches the caller before the LLM finishes generating the response.

4. WebSocket Frame Optimization

We switched from JSON-encoded binary over WebSocket to raw PCM frames with a compact binary header, reducing per-frame overhead from ~40 bytes to 6 bytes.

Results

Stage	v1.0	v2.0
ASR (first word)	150ms	85ms
Intent classification	180ms	90ms
LLM response (first token)	320ms	200ms
TTS (first audio chunk)	250ms	105ms
E2E (perceived)	~900ms	~580ms

Explore more insights from the Verbalyze team

Back to Blog