Voice Activity Detection (VAD): Detecting When Users Speak
Voice activity detection (VAD) is the process of determining whether a given audio frame contains human speech or just background noise/silence. For voice agents, VAD is crucial: it tells the STT engine when the user has finished speaking (silence detection), signals when turn-taking should occur, and reduces unnecessary processing of non-speech audio.
Unlike STT which transcribes speech to text, VAD makes a binary decision per audio frame: speech or non-speech. This happens locally in real time (no cloud API needed), adding minimal latency. This article shows you how VAD works, compares algorithms, and integrates production-grade detection into your voice agent.
How VAD Works
A VAD algorithm analyzes short audio frames (typically 10–30 ms windows) and extracts acoustic features—patterns that distinguish speech from noise. The most common features are:
- Spectral energy: Speech concentrates energy in the 300–3400 Hz range (human vocal formants), while noise is often broadband. VAD compares energy in these bands to background levels.
- Zero-crossing rate: Speech has predictable patterns of zero crossings (when the waveform crosses zero). Noise and music have different patterns.
- Mel-frequency cepstral coefficients (MFCCs): A perceptually-inspired representation of audio that emphasizes frequencies humans hear best. Trained classifiers use MFCCs to distinguish speech from other sounds.
Most modern VAD systems use machine learning: they're trained on thousands of hours of labeled audio (speech vs. noise) to build a classifier. The WebRTC project's open-source VAD (used in Chrome, Teams, Zoom) is one of the best: it's optimized for low latency, runs on CPU, and handles multiple languages.
Three Common VAD Approaches
| Approach | Latency | Accuracy | CPU | Noise Robustness |
|---|---|---|---|---|
| Energy Threshold | 10–20 ms | 70–80% | Very low | Poor (false positives in noise) |
| Spectral Analysis | 20–50 ms | 80–90% | Low | Good |
| Machine Learning (WebRTC VAD) | 30–50 ms | 92–98% | Medium | Excellent |
Integrating WebRTC VAD
The WebRTC VAD is available as a Python package (webrtcvad) and requires minimal setup:
import webrtcvad
import struct
import numpy as np
class VADDetector:
"""
Wraps WebRTC VAD for production speech detection.
Handles frame buffering and silence timeout logic.
"""
def __init__(self, sample_rate=16000, aggressiveness=2):
"""
Initialize VAD detector.
Args:
sample_rate: 8000, 16000, or 32000 Hz (WebRTC supports these only)
aggressiveness: 0-3. Higher = more aggressive speech detection.
0: most lenient (catches faint speech, more false positives)
3: most aggressive (misses faint speech, fewer false positives)
"""
if sample_rate not in [8000, 16000, 32000]:
raise ValueError(f"Sample rate must be 8000, 16000, or 32000 Hz, got {sample_rate}")
self.vad = webrtcvad.VAD(mode=aggressiveness)
self.sample_rate = sample_rate
self.frame_duration_ms = 20 # WebRTC VAD works best at 10, 20, or 30 ms
self.frame_size = int(sample_rate * self.frame_duration_ms / 1000) # Samples per frame
self.is_speech = False
self.silence_count = 0
self.silence_threshold = 10 # Frames of silence before declaring end-of-speech
def process_frame(self, audio_bytes):
"""
Analyze a single audio frame.
Args:
audio_bytes: 16-bit PCM audio, mono, exactly frame_size samples.
Returns:
is_speech: Boolean, True if frame contains speech.
"""
# Validate frame size
expected_bytes = self.frame_size * 2 # 2 bytes per 16-bit sample
if len(audio_bytes) != expected_bytes:
raise ValueError(f"Expected {expected_bytes} bytes, got {len(audio_bytes)}")
# WebRTC VAD returns True if frame contains speech
is_speech = self.vad.is_speech(audio_bytes, self.sample_rate)
return is_speech
def detect_speech_boundaries(self, audio_bytes):
"""
Process audio frame and track speech/silence regions.
Maintains state: are we currently in speech? How long has it been silent?
Returns:
(is_currently_speaking, should_finalize_utterance)
"""
is_speech = self.process_frame(audio_bytes)
if is_speech:
# Frame contains speech
self.is_speech = True
self.silence_count = 0
return True, False
else:
# Frame is silence/noise
if self.is_speech:
# We were speaking, now silent
self.silence_count += 1
if self.silence_count >= self.silence_threshold:
# Enough silence to consider speech ended
self.is_speech = False
return False, True # Signal: finalize the utterance
else:
return True, False # Still in post-speech silence
else:
# Silence continuing
return False, False
# Example: streaming voice agent with VAD
async def voice_agent_with_vad():
"""
Demonstrates VAD integration: listen to microphone,
detect when user starts/stops speaking, and process only speech.
"""
vad = VADDetector(sample_rate=16000, aggressiveness=2)
audio_stream = FullDuplexAudioStream() # From article 2
audio_stream.start_input()
audio_stream.start_output()
transcribed_text = ""
try:
for _ in range(10000): # Run for ~200 seconds (10000 * 20 ms)
# Grab the last 20 ms of audio from input buffer
audio_frame = audio_stream.get_input_audio(duration_ms=20)
if len(audio_frame) < 640: # 640 bytes = 320 samples * 2 bytes
await asyncio.sleep(0.02)
continue
# Analyze frame with VAD
is_speaking, should_finalize = vad.detect_speech_boundaries(audio_frame)
if is_speaking:
print(".", end="", flush=True) # Visual feedback: user is speaking
else:
print(" ", end="", flush=True) # Silence
if should_finalize:
# User finished speaking—send audio to STT
print("\n[VAD] User finished speaking. Sending to STT...")
# Here you'd call your STT API with accumulated audio
# and then feed result to LLM for response generation
transcribed_text = "" # Reset for next utterance
await asyncio.sleep(0.02) # 20 ms per frame
finally:
audio_stream.stop()
asyncio.run(voice_agent_with_vad())
Fine-Tuning VAD Sensitivity
The aggressiveness parameter controls the tradeoff between false positives and false negatives:
- Aggressiveness 0: Catches faint whispers and speech in loud noise. Use for accessibility features or noisy environments.
- Aggressiveness 2: Balanced (recommended for voice agents). Tolerates mild background noise while avoiding false positives on silence.
- Aggressiveness 3: Ignores anything quieter than clear speech. Use in very noisy environments (call centers, factories).
Test your VAD in the target environment before deploying. If users find the agent interrupts them too quickly, lower aggressiveness. If it keeps listening to background noise, raise it.
Adaptive VAD with Noise Estimation
For maximum robustness, estimate the background noise level and adapt the VAD threshold:
class AdaptiveVAD:
"""
Adjusts VAD sensitivity based on background noise level.
"""
def __init__(self, sample_rate=16000):
self.vad = webrtcvad.VAD(mode=2)
self.sample_rate = sample_rate
self.noise_estimate = None
self.frame_count = 0
def estimate_noise_level(self, audio_bytes):
"""
Estimate noise floor from the first 2 seconds of audio.
Assumes the initial audio contains no speech.
"""
if self.frame_count < 100: # First 2 seconds (100 frames * 20 ms)
signal = np.frombuffer(audio_bytes, dtype=np.int16).astype(float)
energy = np.mean(signal ** 2)
if self.noise_estimate is None:
self.noise_estimate = energy
else:
# Exponential moving average: 90% old estimate, 10% new measurement
self.noise_estimate = 0.9 * self.noise_estimate + 0.1 * energy
self.frame_count += 1
def is_speech_adaptive(self, audio_bytes):
"""
Make VAD decision, adjusted for estimated noise level.
"""
self.estimate_noise_level(audio_bytes)
# Base VAD decision
is_speech = self.vad.is_speech(audio_bytes, self.sample_rate)
if not is_speech or self.noise_estimate is None:
return is_speech
# If frame energy is well above noise floor, likely speech even if VAD says no
signal = np.frombuffer(audio_bytes, dtype=np.int16).astype(float)
frame_energy = np.mean(signal ** 2)
snr_db = 10 * np.log10(frame_energy / (self.noise_estimate + 1e-8))
if snr_db > 15: # 15 dB above noise = likely speech
return True
return is_speech
Handling Edge Cases
Music and Speech Confusion
VAD sometimes misclassifies music or singing as non-speech. If your agent needs to detect music (e.g., caller is on hold), pre-filter with a music detector or ask users not to play audio during conversations.
Cross-Talk (Multiple Speakers)
When two people talk simultaneously, WebRTC VAD detects speech but can't distinguish speakers. For agents, this usually means you wait for silence. For call centers, use speaker diarization (a separate model that identifies which speaker is talking) if you need per-speaker tracking.
Accents and Non-Native Speech
WebRTC VAD is trained on diverse accents and performs well across languages. However, it can be overly aggressive with heavy accents or very soft speech. Test with your target user population and adjust aggressiveness accordingly.
Key Takeaways
- Voice activity detection (VAD) classifies audio frames as speech or silence, enabling turn-taking and reducing wasted STT processing.
- WebRTC VAD is production-grade, low-latency, and available as open-source; it's the de facto standard in browsers and communication apps.
- Aggressiveness parameter ranges 0–3; use 2 (balanced) for most voice agents, increase for noise, decrease for quiet or accented speech.
- Implement silence timeout logic: when VAD reports silence for 10–20 frames (200–400 ms), finalize the user's utterance and send to STT.
- Adaptive VAD estimates background noise and adjusts decisions based on signal-to-noise ratio, improving robustness in varying environments.
Frequently Asked Questions
Can I use VAD on compressed audio (MP3, Opus)?
No. VAD requires raw PCM audio (typically 16-bit signed integers). If your audio is compressed, decompress it first. Most realtime APIs deliver compressed Opus but decompress it server-side for VAD. If you're processing locally, decompress before VAD.
Why does my VAD trigger on background noise?
VAD aggressiveness may be too low (too lenient). Increase from 2 to 3. Alternatively, estimate the noise floor in the first 2 seconds of silence (before the user speaks) and raise the detection threshold. Machine learning-based VAD is more robust than energy thresholds alone.
How long of silence indicates the user is done speaking?
Typically 400–800 ms (20–40 frames of silence). Shorter timeouts interrupt natural pauses mid-sentence; longer timeouts add latency. Most production systems use 800 ms as a balance.
Can VAD work on telephony audio (8 kHz)?
Yes. WebRTC VAD supports 8 kHz sample rate. However, 8 kHz is lower quality (audio spectrum is halved), so VAD may be less accurate. For best accuracy, upsample 8 kHz to 16 kHz before VAD, or use a VAD model trained specifically for telephony.
What's the difference between VAD and speech detection?
VAD detects voice activity (is there sound coming from a human voice?). Speech detection is broader and might include detecting speech language or speaker identity. VAD is a building block for speech detection.