Skip to main content

Latency Budgeting: Speed Goals for Realtime Voice

Voice agent users perceive delays acutely. A human on the phone expects a response within 1–1.5 seconds; delays beyond 2 seconds feel broken. However, each component of a voice agent (STT, LLM, TTS) introduces latency. Latency budgeting is the engineering discipline of allocating maximum acceptable delays to each component so the total stays within human perception thresholds.

For example, if your budget is 1.5 seconds and STT takes 500 ms, you have 1 second left for LLM + TTS. Push LLM beyond its capabilities, and you'll miss your budget. This article shows how to measure latency, build a budget, and optimize components to fit.

Understanding End-to-End Latency

End-to-end latency in a voice agent spans from when the user finishes speaking to when they hear the agent respond:

User speaks → VAD silence (20 ms)
→ STT processing (300–500 ms)
→ STT finalization (100 ms)
→ LLM inference (800–1500 ms)
→ TTS synthesis (200–400 ms)
→ Audio playback begins (0 ms if streamed)
────────────────────────────────────
Total: ~1.4–2.9 seconds

This is the worst-case scenario (sequential processing). Modern agents parallelize components to reduce this.

Building a Latency Budget

Start with your target: 1.5 seconds for voice agents, 2+ seconds for less time-critical applications. Allocate latency to each component:

Component1.5s Budget2.0s BudgetNotes
VAD silence threshold200 ms200 msUser finished speaking
STT latency300 ms400 msStreaming STT finalization
LLM inference600 ms900 msTime to first token (TTFT)
TTS synthesis300 ms400 msTime to first audio frame
Network/buffer jitter100 ms100 msSafety margin
Total1.5s2.0sPerception threshold

Measuring Latency Components

Instrument your agent to measure each component:

import time
from dataclasses import dataclass
from typing import Dict

@dataclass
class LatencyMetrics:
"""Records latency for each pipeline stage."""
vad_silence_ms: float = 0
stt_start_ms: float = 0
stt_end_ms: float = 0
llm_ttft_ms: float = 0 # Time to first token
llm_end_ms: float = 0 # Time to last token
tts_start_ms: float = 0
tts_first_frame_ms: float = 0
tts_last_frame_ms: float = 0
playback_start_ms: float = 0

@property
def stt_latency(self):
return self.stt_end_ms - self.stt_start_ms

@property
def llm_latency(self):
return self.llm_end_ms - self.llm_ttft_ms # E2E LLM time

@property
def ttft(self):
"""Time to first token (streaming LLM latency)."""
return self.llm_ttft_ms - self.stt_end_ms

@property
def tts_latency(self):
return self.tts_last_frame_ms - self.tts_start_ms

@property
def end_to_end(self):
"""Total latency: user finishes → audio playback."""
return self.playback_start_ms - self.vad_silence_ms

def report(self):
"""Print latency report."""
print(f"""
Latency Report:
─────────────────────────
VAD silence threshold: {self.vad_silence_ms:6.1f} ms
STT latency: {self.stt_latency:6.1f} ms
LLM TTFT (streaming): {self.ttft:6.1f} ms
LLM total inference: {self.llm_latency:6.1f} ms
TTS latency: {self.tts_latency:6.1f} ms
─────────────────────────
End-to-end latency: {self.end_to_end:6.1f} ms (target: 1500 ms)
""")

class InstrumentedVoiceAgent:
"""Voice agent that measures latency at each stage."""

def __init__(self, budget_ms=1500):
self.budget_ms = budget_ms
self.metrics = LatencyMetrics()
self.timeline = {} # For detailed timeline analysis

async def process_user_utterance(self, audio_bytes):
"""Process user speech and measure latency at each step."""

# 1. VAD: User finished speaking (detected by VAD silence threshold)
self.metrics.vad_silence_ms = time.time() * 1000

# 2. STT: Transcribe audio
self.metrics.stt_start_ms = time.time() * 1000
transcript = await self.stt_client.transcribe_streaming(audio_bytes)
self.metrics.stt_end_ms = time.time() * 1000

print(f"STT latency: {self.metrics.stt_latency:.0f} ms (budget: 300–400 ms)")

# 3. LLM: Generate response (streaming)
self.metrics.llm_ttft_ms = time.time() * 1000
response_text = ""

# Check if LLM TTFT is within budget
if self.metrics.ttft > 600:
print(f"WARNING: LLM TTFT {self.metrics.ttft:.0f} ms exceeds budget (600 ms)")

async for chunk in self.llm_client.stream_completion(
prompt=transcript,
max_tokens=100
):
response_text += chunk

self.metrics.llm_end_ms = time.time() * 1000

# 4. TTS: Synthesize speech
self.metrics.tts_start_ms = time.time() * 1000

first_frame_received = False
async for audio_frame in self.tts_client.stream_audio(response_text):
if not first_frame_received:
self.metrics.tts_first_frame_ms = time.time() * 1000
first_frame_received = True

# Queue audio for playback
await self.audio_output.queue_output_audio(audio_frame)

self.metrics.tts_last_frame_ms = time.time() * 1000

# 5. Playback begins
self.metrics.playback_start_ms = time.time() * 1000

# Report
self.metrics.report()

# Check if we're within budget
if self.metrics.end_to_end > self.budget_ms:
print(f"OVER BUDGET: {self.metrics.end_to_end:.0f} ms vs {self.budget_ms} ms")
else:
print(f"Within budget: {self.metrics.end_to_end:.0f} ms vs {self.budget_ms} ms")

Optimization Strategies

1. Parallelize STT and LLM

Don't wait for STT to finish before starting LLM inference. Use streaming STT to send partial transcripts to the LLM early:

async def parallel_stt_llm(self, audio_bytes):
"""
Stream audio to STT while simultaneously streaming partial results to LLM.
Reduces LLM latency by starting inference early.
"""
transcript_queue = asyncio.Queue()

async def stt_task():
"""Run STT in background, queue transcript chunks."""
async for chunk in self.stt_client.stream(audio_bytes):
await transcript_queue.put(chunk)
await transcript_queue.put(None) # EOF signal

async def llm_task():
"""Start LLM inference with partial transcripts."""
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": ""} # Placeholder
]
partial_transcript = ""

# Start LLM streaming even though transcript isn't finalized
stt_task_handle = asyncio.create_task(stt_task())

while True:
try:
chunk = transcript_queue.get_nowait()
if chunk is None:
break
partial_transcript += chunk
messages[1]["content"] = partial_transcript
except asyncio.QueueEmpty:
pass

# Periodically send partial transcript to LLM
# In practice, you'd buffer and send every N words

# Wait for STT to complete
await stt_task_handle
# Final LLM generation with complete transcript

await llm_task()

2. Use Streaming LLM

Streaming LLMs (Claude, GPT-4-turbo) emit tokens one at a time rather than returning a complete response. This lets you start TTS synthesis as soon as the first tokens arrive, reducing perceived latency by 200–400 ms:

async def streaming_llm_to_tts(self, transcript):
"""
Stream LLM tokens directly to TTS without waiting for full response.
"""
tts_queue = asyncio.Queue(maxsize=10) # Buffer TTS chunks

async def llm_to_tts():
"""LLM tokens flow directly to TTS buffer."""
async for token in self.llm_client.stream_tokens(transcript):
# Accumulate tokens until you have a complete phrase
await tts_queue.put(token)

async def tts_from_queue():
"""TTS consumes tokens as they arrive."""
partial_text = ""
while True:
try:
token = tts_queue.get_nowait()
partial_text += token

# Send to TTS when we have a complete phrase (ends with space/punctuation)
if partial_text.endswith(('.', '!', '?', '...', ' ')):
async for audio_frame in self.tts_client.stream_audio(partial_text):
await self.audio_output.queue_output_audio(audio_frame)
partial_text = ""
except asyncio.QueueEmpty:
await asyncio.sleep(0.01)

# Run LLM and TTS concurrently
await asyncio.gather(llm_to_tts(), tts_from_queue())

3. Pre-Buffer TTS Frames

Begin playing TTS audio as soon as the first frame arrives, rather than waiting for a full second of audio:

async def stream_tts_with_immediate_playback(self, text):
"""
Play TTS audio frame-by-frame with zero buffering.
Reduces TTS-to-ear latency by 100–200 ms.
"""
async for audio_frame in self.tts_client.stream_audio(text):
# Play immediately; don't accumulate in a buffer
await self.audio_output.play_immediately(audio_frame)

Key Takeaways

  • Latency budgeting allocates maximum acceptable delays to each component (STT, LLM, TTS) so the total stays under 1.5 seconds (user perception threshold).
  • Measure end-to-end latency from user speech end to audio playback start. Instrument each stage: VAD, STT, LLM TTFT, TTS first frame.
  • Parallelize components: stream partial STT results to the LLM, stream LLM tokens to TTS, and play TTS frames immediately without pre-buffering.
  • Use streaming APIs (STT, LLM, TTS) wherever possible; non-streaming round-trips add 500+ ms of latency per component.
  • Target 1–1.5 seconds for responsive voice agents; 2+ seconds feels noticeably slow.

Frequently Asked Questions

What if my LLM inference exceeds the latency budget?

Three options: (1) use a faster, smaller model, (2) send a shorter context window to the LLM, (3) increase the user-perceived budget (accept 2+ seconds). For instance, GPT-3.5 is faster than GPT-4; Haiku is faster than Opus. Trade accuracy for speed if needed.

How much does streaming reduce latency compared to batch?

For STT, 200–300 ms. For LLM, 300–500 ms (you start TTS synthesis sooner). For TTS, 100–200 ms. Total savings: 600–1000 ms, which often brings you from 2+ seconds to under 1.5 seconds.

Can I cache LLM responses to reduce latency?

Yes, for common user inputs. Implement a simple cache: if the user asks the same question twice, return the cached TTS audio instantly. However, be cautious: cached responses may become outdated if they reference real-time data (weather, prices).

What if network latency to the cloud is high?

Use on-device models. Open-source STT (Whisper), LLM (Llama), and TTS (Piper) can run locally and eliminate network round trips. Trade cloud accuracy for local speed. Hybrid models (local inference + cloud refinement) are becoming common in 2026.

How do I measure latency in production?

Log timestamps at each stage and compute deltas. Send metrics to a monitoring service (CloudWatch, Datadog). Set up alerts if latency exceeds 2 seconds on 10%+ of requests. Use percentile metrics (p95, p99) rather than averages, since tail latency determines perceived quality.

Further Reading