Skip to main content

Real-Time Speech to Text: Streaming ASR

Real-time speech-to-text transcription requires a fundamentally different architecture than batch processing. While offline Whisper API calls accept a complete audio file and return a full transcript, streaming ASR systems emit partial hypotheses as audio arrives, maintaining latency under 500 ms for live captions, voice commands, and interactive voice bots. Streaming systems trade some accuracy for latency: they use smaller models, process audio in fixed-size chunks, and rely on local language models to correct incomplete context. Production streaming ASR typically uses Google Cloud Speech-to-Text, Azure Speech Services, or self-hosted Conformer models with WebSocket backends to keep latency predictable and costs manageable.

Streaming Architecture: Chunks, Buffers, and Partial Results

Streaming ASR works by buffering short audio segments (100–200 ms, typically 1,600–3,200 samples at 16 kHz) and feeding them through a continuous neural network. Unlike batch models that wait for complete utterances, streaming models use restricted context windows—the current frame plus a small lookback of previous frames—to produce intermediate results. These partial hypotheses refine as more audio arrives, a process called rescoring.

# Conceptual streaming architecture
class StreamingASRPipeline:
def __init__(self, model_name, context_ms=200):
"""
model_name: 'google', 'azure', or 'conformer'
context_ms: How many milliseconds of audio the model can look back
"""
self.model_name = model_name
self.context_frames = int(16000 * context_ms / 1000) # Convert ms to samples
self.audio_buffer = []
self.final_transcript = ""
self.partial_transcript = ""

def add_audio_chunk(self, audio_chunk):
"""
audio_chunk: bytes of raw 16-bit PCM audio (mono, 16 kHz)
"""
self.audio_buffer.extend(audio_chunk)

# If buffer has enough samples for a frame (160 samples = 10 ms), process
if len(self.audio_buffer) >= 160:
frame = self.audio_buffer[:160]
self.audio_buffer = self.audio_buffer[160:]

# Send frame to streaming ASR model
result = self._send_to_model(frame)

# Results include partial and final hypotheses
if result.get("is_final"):
self.final_transcript += result["text"] + " "
else:
self.partial_transcript = result["text"]

return result

def _send_to_model(self, frame):
"""Stub: In production, send to Google/Azure/local model via gRPC or WebSocket."""
return {"text": "partial result", "is_final": False}

The key insight: streaming models process 10–20 ms frames in real time (10 ms of audio takes ~5 ms to process on modern hardware). A microphone captures audio at 16,000 samples/second, so a 10 ms frame is 160 samples. The model produces partial results every 20–100 ms and commits to final results only after silence or explicit boundary markers.

Google Cloud Speech-to-Text Streaming

Google's streaming API is the industry standard for production use. It supports 125+ languages, achieves 5–10% WER with accents and noise, and offers low-latency (100–300 ms) with proper configuration:

from google.cloud import speech_v1
import io

def google_streaming_transcribe(audio_stream_generator):
"""
Transcribe audio from a generator (e.g., microphone or WebSocket).
Yields partial and final transcripts in real-time.

audio_stream_generator: yields chunks of 16-bit PCM audio (16 kHz, mono)
"""
client = speech_v1.SpeechClient()

# Configure the streaming request
config = speech_v1.RecognitionConfig(
encoding=speech_v1.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=16000,
language_code="en-US",
max_alternatives=1,
enable_automatic_punctuation=True,
model="latest_long", # Or 'default' for lower latency
use_enhanced=False # Set True for enhanced accuracy (higher cost)
)

streaming_config = speech_v1.StreamingRecognitionConfig(
config=config,
interim_results=True, # Emit partial results
single_utterance=False # Don't stop on silence
)

# Create a request generator
def request_generator():
for content in audio_stream_generator:
yield speech_v1.StreamingRecognizeRequest(audio_content=content)

# Call the streaming API
responses = client.streaming_recognize(streaming_config, request_generator())

final_transcript = ""

for response in responses:
if not response.results:
continue

# Get the first (best) result
result = response.results[0]
transcript = result.alternatives[0].transcript
is_final = result.is_final

if is_final:
final_transcript += transcript + " "
print(f"FINAL: {transcript}")
else:
print(f"INTERIM: {transcript}")

yield {
"transcript": transcript,
"is_final": is_final,
"confidence": result.alternatives[0].confidence
}

return final_transcript

# Usage with microphone input (requires pyaudio)
# import pyaudio
#
# def microphone_stream():
# mic = pyaudio.PyAudio()
# stream = mic.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, chunk_size=6400)
# try:
# while True:
# chunk = stream.read(6400)
# if len(chunk) < 6400:
# break
# yield chunk
# finally:
# stream.stop_stream()
# stream.close()
# mic.terminate()
#
# for result in google_streaming_transcribe(microphone_stream()):
# print(result)

Google's streaming API costs $0.0001 per 15-second interval (on-device speech + model selection). For an 8-hour call, that's $0.96 per hour. The API automatically handles silence detection, punctuation, and language switching within a call.

WebSocket-Based Real-Time Transcription

For web applications, WebSocket provides bidirectional audio streaming:

import asyncio
import json
import websockets
from google.cloud import speech_v1

async def websocket_asr_server(websocket, path):
"""
WebSocket server: receives audio chunks from client, streams transcripts back.
"""
client = speech_v1.SpeechClient()

config = speech_v1.RecognitionConfig(
encoding=speech_v1.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=16000,
language_code="en-US",
max_alternatives=1,
enable_automatic_punctuation=True
)

streaming_config = speech_v1.StreamingRecognitionConfig(
config=config,
interim_results=True
)

audio_chunks = asyncio.Queue()

async def send_audio():
"""Send audio chunks to the API."""
# First request must contain the config
yield speech_v1.StreamingRecognizeRequest(streaming_config=streaming_config)

# Subsequent requests contain audio
while True:
chunk = await audio_chunks.get()
if chunk is None: # Sentinel: end of stream
break
yield speech_v1.StreamingRecognizeRequest(audio_content=chunk)

responses = client.streaming_recognize(streaming_config, send_audio())

try:
async for message in websocket:
# Parse incoming audio chunk from client
data = json.loads(message)
if data.get("type") == "audio":
await audio_chunks.put(bytes(data["audio"]))
elif data.get("type") == "end":
await audio_chunks.put(None)

# Stream responses back to client
for response in responses:
if response.results:
result = response.results[0]
transcript = result.alternatives[0].transcript
await websocket.send(json.dumps({
"type": "transcript",
"text": transcript,
"is_final": result.is_final
}))
except websockets.exceptions.ConnectionClosed:
print("Client disconnected")

# Run the server
if __name__ == "__main__":
start_server = websockets.serve(websocket_asr_server, "0.0.0.0", 8765)
asyncio.get_event_loop().run_until_complete(start_server)
print("WebSocket server listening on ws://0.0.0.0:8765")
asyncio.get_event_loop().run_forever()

Client-side JavaScript:

const socket = new WebSocket('ws://localhost:8765');
const audioContext = new (window.AudioContext || window.webkitAudioContext)();
const mediaStream = await navigator.mediaDevices.getUserMedia({ audio: true });
const source = audioContext.createMediaStreamSource(mediaStream);
const processor = audioContext.createScriptProcessor(4096, 1, 1);

source.connect(processor);
processor.connect(audioContext.destination);

processor.onprocessorerror = (event) => {
console.error('Audio processing error', event);
};

processor.onaudioprocess = (event) => {
const audioData = event.inputBuffer.getChannelData(0);
socket.send(JSON.stringify({
type: 'audio',
audio: Array.from(audioData)
}));
};

socket.onmessage = (event) => {
const result = JSON.parse(event.data);
if (result.type === 'transcript') {
const element = result.is_final ? 'final' : 'interim';
document.getElementById(element).textContent = result.text;
}
};

This pattern achieves 200–400 ms end-to-end latency (audio capture + transmission + ASR + response). The latency is dominated by audio buffering (100–200 ms) and ASR processing (100–200 ms).

Latency Optimization Strategies

Latency in streaming ASR comes from four sources: audio buffering, network transmission, ASR inference, and response serialization. To minimize:

SourceTypicalOptimization
Audio buffering100–200 msUse small chunks (10–20 ms); accept some jitter.
Network transmission10–50 msUse local servers or low-latency networks (5G, fiber).
ASR inference50–200 msUse smaller models; quantize weights; batch multiple frames.
Response serialization10–20 msStream protobuf instead of JSON; use binary formats.

For ultra-low-latency (100 ms) applications, deploy Wav2Vec2 or Conformer models locally and use gRPC with streaming support. For web-only clients, WebSocket with Google/Azure is the practical minimum.

Key Takeaways

  • Streaming ASR processes 10–20 ms audio frames and emits partial results in real-time, achieving 100–500 ms latency.
  • Google Cloud Speech-to-Text and Azure Speech Services are production-ready for streaming with multi-language support and automatic punctuation.
  • WebSocket enables bidirectional audio/transcript streaming for web applications; end-to-end latency is typically 200–400 ms.
  • Latency comes from buffering (100–200 ms), network (10–50 ms), inference (50–200 ms), and serialization (10–20 ms); optimize the largest contributor first.
  • For <100 ms latency, deploy lightweight models (Wav2Vec2, Conformer) on-device with gRPC and quantization.

Frequently Asked Questions

How much audio must I buffer before the model processes it?

Most streaming models work on 10–20 ms frames (160–320 samples at 16 kHz). You can process every 10 ms, but practical buffering is 20–100 ms to reduce overhead. Google's API processes frames continuously as they arrive, not in batches.

What is the difference between interim and final results in streaming ASR?

Interim results are partial transcripts that may change as more audio arrives; final results are committed and locked. A final result typically appears 100–500 ms after you stop speaking (silence detection). Use interim for live displays, final for logging.

Can I reduce streaming ASR costs?

Google charges per 15-second interval; Azure charges per minute. If you transcribe many short utterances, costs add up. For cost-sensitive applications, batch audio into longer files and use offline ASR, or self-host a lightweight model like Wav2Vec2.

Does streaming ASR handle multiple speakers?

Basic streaming ASR treats all speakers as a single stream. To identify speakers in real-time, add a speaker diarization model (e.g., pyannote, WhisperSpeaker) or post-process final transcripts with speaker labels.

How do I handle audio dropout or network latency in WebSocket?

Buffer incoming audio on the client (50–100 ms) to absorb jitter. If the buffer underflows, insert silence. On the server, implement request timeouts (5–10 seconds per audio chunk) and reconnect logic on the client.

Further Reading