Skip to main content

Realtime Voice Agents: Speech API Basics

A realtime voice agent is a conversational system that captures audio from a user, converts it to text via a speech-to-text (STT) API, runs it through an LLM, and speaks the response back using text-to-speech (TTS)—all with latencies under 500 ms per turn. The key difference from traditional voice systems is streaming: both audio input and output flow continuously, overlapping in time, mimicking a natural phone conversation rather than the slower request-response cycle of a chatbot.

In this article, you'll learn the architecture of realtime speech APIs, the main providers offering low-latency streaming, and how to build a minimal Python agent that listens, understands, and responds.

What Are Streaming Speech APIs?

Speech APIs in 2026 come in two flavors: batch and streaming. Batch APIs (like Google Cloud Speech-to-Text with batchTranscribe) wait for a complete audio file, process it offline, and return results in seconds—suitable for post-call transcription but not conversation. Streaming APIs open a persistent connection, ingest audio in chunks (typically 20–40 ms of audio data at 16 kHz sample rate), and return partial transcription results in real time.

Streaming STT works by sending compressed audio (often Opus or µ-law) over WebSocket or gRPC to the provider's endpoint. The service returns interim hypotheses (candidate transcriptions updating every 100–300 ms) and final transcriptions when the system detects silence. This is essential for voice agents because it lets you start preparing a response before the user finishes speaking—reducing perceived latency by 200–400 ms compared to waiting for the final transcript.

Why Streaming Matters for Voice Agents

Traditional turn-by-turn conversation (user speaks → wait for complete transcript → generate response → speak back) introduces cumulative delays: 500 ms for STT latency, 800 ms for LLM response, 300 ms for TTS buffering. Total: ~1.6 seconds before you hear anything. With streaming, you can start TTS synthesis (or even generate text incrementally with a streaming LLM) as soon as interim transcripts arrive, cutting perceived latency by 60%. This is psychologically crucial—users begin to perceive latency over 1 second as "laggy" and 2+ seconds as "broken."

Key Providers and Their Characteristics

ProviderSTT LatencyTTS LatencyCost ModelStreamingMulti-Language
OpenAI Realtime API200–300 ms100–200 msPer-token pricingYes (WebSocket)26 languages
Google Cloud Speech (Streaming)300–500 ms300–400 msPer 15-second blockYes (gRPC/REST)125+ languages
Azure Cognitive Services (Speech)250–400 ms200–350 msPer hour quotaYes (WebSocket)90+ languages
ElevenLabs (Streaming)N/A (STT only)100–250 msPer characterYes (WebSocket)32 languages

The OpenAI Realtime API (launched late 2024) is purpose-built for voice agents and offers the lowest round-trip latency. Google Cloud Speech is commodity pricing; Azure sits between. ElevenLabs specializes in high-quality TTS and integrates well with third-party STT.

Building a Minimal Streaming Voice Agent

Here's a complete, annotated Python agent using the OpenAI Realtime API to stream audio and respond conversationally:

import asyncio
import base64
import json
import os
from openai import AsyncOpenAI

# Initialize OpenAI client for realtime API
client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))

async def stream_voice_agent():
"""
Minimal voice agent: Opens a realtime session, processes microphone input,
and streams responses back. Requires pyaudio and a microphone.
"""
# System prompt: Guides the agent's personality and behavior
system_prompt = (
"You are a helpful voice assistant. Keep responses concise (under 20 words). "
"Be conversational and friendly."
)

# Start a realtime session with the OpenAI API
# The session parameter specifies modality and model behavior
async with client.beta.realtime.connect(
model="gpt-4-realtime-preview",
modalities=["text", "audio"], # Enable both input and output modalities
instructions=system_prompt,
voice="alloy" # Pre-recorded voice (options: alloy, echo, fable, onyx, nova, shimmer)
) as session:
print("Voice agent started. Speak now...")

# In a real implementation, you'd capture audio from the microphone
# For this demo, we'll send a test message
audio_chunk = b"some_raw_audio_bytes_here" # Placeholder

# Send audio event to the realtime API
await session.input_audio_buffer.append(audio=audio_chunk)

# Process responses as they stream back
async for response in session:
if response.type == "response.audio_transcript.delta":
# This event fires when an interim transcript arrives
print(f"Agent says (interim): {response.delta}")
elif response.type == "response.text.delta":
# Streaming text response (useful for logging or dual-mode output)
print(f"Agent: {response.delta}", end="", flush=True)
elif response.type == "input_audio_buffer.speech_started":
# User began speaking
print("[User started speaking]")
elif response.type == "input_audio_buffer.speech_stopped":
# User paused; STT will finalize the transcript
print("[User paused/finished]")

asyncio.run(stream_voice_agent())

How This Agent Works

  1. Session initialization: The connect() call opens a persistent WebSocket to OpenAI's realtime endpoint. You specify the system prompt (instructions), desired voice, and modalities (audio input/output or text-only).
  2. Audio input: Audio captured from the microphone is appended to the input_audio_buffer. The API streams it to the STT backend.
  3. Event loop: As responses arrive asynchronously, your code reacts to events like speech_started (user began talking), transcript deltas (interim text), and text responses (LLM output).
  4. Streaming responses: The agent's reply streams back as audio frames, which you'd play through a speaker in real time.

Integrating Microphone Input and Speaker Output

The snippet above is conceptual. To make it work with real audio hardware, use pyaudio:

import pyaudio
import threading
from collections import deque

class MicrophoneAudioStream:
"""
Captures audio from the system microphone in real time.
Runs in a background thread to avoid blocking the main event loop.
"""
def __init__(self, sample_rate=16000, chunk_duration_ms=20):
self.sample_rate = sample_rate
self.chunk_size = int(sample_rate * chunk_duration_ms / 1000) # ~320 samples at 16 kHz
self.buffer = deque(maxlen=10) # Hold up to 200 ms of audio
self.p = pyaudio.PyAudio()
self.stream = None
self.running = False

def start(self):
"""Open the microphone and begin capturing."""
self.stream = self.p.open(
format=pyaudio.paInt16,
channels=1,
rate=self.sample_rate,
input=True,
frames_per_buffer=self.chunk_size
)
self.running = True
# Start background thread to read audio continuously
self.thread = threading.Thread(target=self._read_audio, daemon=True)
self.thread.start()

def _read_audio(self):
"""Background thread: reads audio chunks and buffers them."""
while self.running:
try:
audio_chunk = self.stream.read(self.chunk_size, exception_on_overflow=False)
self.buffer.append(audio_chunk)
except Exception as e:
print(f"Microphone read error: {e}")

def get_audio(self):
"""Retrieve and clear the buffered audio."""
if self.buffer:
return b"".join(self.buffer)
return b""

def stop(self):
"""Stop capturing and clean up."""
self.running = False
if self.stream:
self.stream.stop_stream()
self.stream.close()
self.p.terminate()

# Usage in your voice agent
mic = MicrophoneAudioStream()
mic.start()
# ... feed mic.get_audio() to your realtime API in a loop ...
mic.stop()

Key Takeaways

  • Realtime voice agents use streaming speech APIs to achieve latencies under 500 ms per turn by processing audio in real time rather than batch mode.
  • Streaming STT returns interim transcripts every 100–300 ms, allowing you to start synthesis or response generation before the user finishes speaking.
  • OpenAI's Realtime API, Google Cloud Speech (Streaming), and Azure Cognitive Services are leading options; choose based on latency requirements, cost, and language coverage.
  • A minimal agent requires a system prompt, persistent WebSocket connection, microphone/speaker I/O, and an event loop to handle asynchronous responses.
  • Audio is captured in 20–40 ms chunks at 16 kHz sample rate, compressed (e.g., Opus or µ-law), and streamed continuously to the STT backend.

Frequently Asked Questions

What sample rate should I use for realtime voice?

16 kHz (16,000 samples per second) is the industry standard for voice conversations and offers a good balance between quality and bandwidth. Some APIs support 8 kHz for bandwidth savings on low-bandwidth networks, but 16 kHz is recommended for natural speech. Higher rates like 48 kHz add minimal perceptual benefit for voice agents.

Can I mix streaming STT with a non-streaming LLM?

Yes. You can collect interim STT results, finalize them when the user pauses, send a complete sentence to a traditional (non-streaming) LLM endpoint, and then stream the TTS response back to the user. This trades some latency for simpler LLM integration. However, streaming LLMs (like Claude or GPT-4-turbo with streaming enabled) perform better because you can start speaking as soon as the first tokens arrive.

How much bandwidth does realtime voice use?

At 16 kHz, mono, 16-bit audio, uncompressed raw audio consumes about 256 kbps. With Opus compression (the standard for VoIP), you'll use 20–40 kbps. Over a typical mobile connection, this is negligible. However, sending audio every 20 ms means your API must be fast—any latency in network round trips directly impacts conversation quality.

What if my STT provider doesn't offer streaming?

If forced to use batch-only APIs, collect audio for 1–2 second windows, send for transcription, and respond. This adds 1–2 seconds of latency per turn, making the conversation feel slow. It's better to switch to a streaming provider or implement a fallback voice quality lower than a dedicated streaming API.

How do I handle network interruptions?

Voice agents over the internet are vulnerable to packet loss and jitter. Implement audio buffering (typically 100–200 ms of audio held locally) so brief network glitches don't cause audio dropouts. If the connection drops, gracefully reconnect and ask the user to repeat their last message. For critical applications (emergency services, medical calls), use dedicated carrier networks (PSTN) rather than the internet.

Further Reading