Barge-In Interruptions: Let Users Cut Off Your Agent
Barge-in is the ability for a user to interrupt an agent mid-sentence and have the agent immediately stop speaking and start listening. Without barge-in, conversations feel one-sided and robotic: the user must wait for the agent to finish before they can speak. This is unacceptable for natural voice interactions. Barge-in detection, however, is subtle: you must distinguish between momentary overlap (the user clears their throat while you're speaking) and genuine interruption (the user wants to cut in).
This article covers barge-in detection algorithms, integration with your voice agent's state machine, and strategies to make interruptions feel natural rather than jarring.
Detecting Barge-In
Barge-in detection relies on simultaneous audio analysis: you're playing agent TTS and capturing microphone input. If the microphone suddenly shows speech energy while the agent is mid-response, the user is trying to interrupt.
Simple Energy-Based Barge-In
The simplest approach compares microphone signal energy to a threshold:
class SimpleBargeInDetector:
"""
Detects user speech overlapping with agent speech.
Uses energy thresholding—fast, but prone to false positives.
"""
def __init__(self, sample_rate=16000, energy_threshold_db=-30):
"""
Args:
sample_rate: Audio sample rate (Hz)
energy_threshold_db: Speech detection threshold in dB.
-30 dB: sensitive, catches whispers and background
-20 dB: moderate, standard voice
-10 dB: aggressive, only loud speech
"""
self.sample_rate = sample_rate
self.energy_threshold_db = energy_threshold_db
self.noise_floor = -50 # Estimated baseline noise (dB)
def calculate_energy_db(self, audio_frame):
"""
Calculate RMS energy of audio frame in dB.
Args:
audio_frame: 16-bit PCM audio bytes
Returns:
Energy in dB (where 0 dB = full scale)
"""
import numpy as np
signal = np.frombuffer(audio_frame, dtype=np.int16).astype(float)
rms = np.sqrt(np.mean(signal ** 2))
# RMS to dB: dB = 20 * log10(RMS / max_value)
# max value for 16-bit signed int = 32767
energy_db = 20 * np.log10(rms / 32767 + 1e-10)
return energy_db
def is_user_interrupting(self, audio_frame, agent_is_speaking=True):
"""
Determine if user is interrupting.
Args:
audio_frame: Current microphone audio
agent_is_speaking: Boolean, is agent currently playing TTS?
Returns:
True if barge-in detected, False otherwise
"""
if not agent_is_speaking:
return False # Only detect barge-in when agent is speaking
energy_db = self.calculate_energy_db(audio_frame)
# Barge-in is detected if energy exceeds threshold
# and exceeds background noise by a margin (SNR > 10 dB)
snr_db = energy_db - self.noise_floor
if energy_db > self.energy_threshold_db and snr_db > 10:
return True
return False
# Usage
detector = SimpleBargeInDetector(energy_threshold_db=-20)
# In your audio loop:
# if detector.is_user_interrupting(audio_frame, agent_is_speaking=True):
# stop_tts()
# transition_to_listening()
ML-Based Barge-In with VAD
A more robust approach combines voice activity detection with timing:
import asyncio
from collections import deque
class MLBargeInDetector:
"""
Uses VAD + timing to detect barge-in.
More accurate than energy alone; requires VAD model.
"""
def __init__(self, vad_detector, barge_in_delay_ms=300):
"""
Args:
vad_detector: WebRTC VAD instance (from article 3)
barge_in_delay_ms: Grace period before recognizing barge-in.
Prevents false positives on mouth clicks, breath noise.
"""
self.vad = vad_detector
self.barge_in_delay_ms = barge_in_delay_ms
self.barge_in_frame_count = 0
self.frames_until_barge_in = int(barge_in_delay_ms / 20) # Convert to frames (20 ms each)
self.speech_window = deque(maxlen=50) # Recent VAD decisions (history)
def process_frame(self, audio_frame):
"""
Update barge-in detection based on new audio frame.
Returns:
(is_speech, barge_in_detected)
"""
is_speech = self.vad.process_frame(audio_frame)
self.speech_window.append(is_speech)
if is_speech:
self.barge_in_frame_count += 1
else:
self.barge_in_frame_count = 0
# Barge-in is detected if VAD reports speech for N consecutive frames
barge_in = self.barge_in_frame_count >= self.frames_until_barge_in
return is_speech, barge_in
def get_barge_in_confidence(self):
"""
Return 0–1 confidence score for barge-in.
Useful for gradual response (e.g., fade out TTS instead of abrupt stop).
"""
return min(1.0, self.barge_in_frame_count / self.frames_until_barge_in)
Integrating Barge-In into Your Voice Agent
Update your turn-taking manager to handle barge-in:
class TurnTakingManagerWithBargeIn(TurnTakingManager):
"""Extends TurnTakingManager from article 4 with barge-in handling."""
def __init__(self, *args, barge_in_detector, **kwargs):
super().__init__(*args, **kwargs)
self.barge_in_detector = barge_in_detector
self.agent_is_speaking = False
self.barge_in_detected = False
async def on_audio_frame(self, audio_frame):
"""Process audio frame, checking for barge-in."""
# Always check for barge-in when agent is speaking
if self.agent_is_speaking:
is_speech, barge_in = self.barge_in_detector.process_frame(audio_frame)
if barge_in:
print("[Barge-In] User interrupted agent")
self.barge_in_detected = True
await self._handle_barge_in_interrupt()
return
# Continue normal turn-taking logic
await super().on_audio_frame(audio_frame)
async def _handle_barge_in_interrupt(self):
"""Stop agent speech and prepare to listen."""
self.agent_is_speaking = False
self.state = ConversationState.LISTENING
self.barge_in_detected = False
# Signal TTS stream to stop (implementation depends on TTS library)
await self.tts.cancel_streaming()
# Clear pending user audio to start fresh
self.pending_user_audio = bytearray()
self.utterance_start = None
print("[Barge-In] Agent stopped. Ready to listen.")
async def _transition_to_speaking(self):
"""Override: set agent_is_speaking flag for barge-in detection."""
self.agent_is_speaking = True
await super()._transition_to_speaking()
self.agent_is_speaking = False
Strategies for Graceful Interruption
Abruptly stopping TTS can feel jarring. Consider these strategies:
Strategy 1: Crossfade
Gradually fade out the agent's voice while fading in the user's microphone input:
async def crossfade_to_user(self, fade_duration_ms=200):
"""
Fade out agent TTS and fade in user microphone.
Creates a smooth transition rather than abrupt cutoff.
"""
frames_to_fade = int(fade_duration_ms / 20) # 20 ms per frame
for i in range(frames_to_fade):
# Reduce TTS volume
fade_factor = (frames_to_fade - i) / frames_to_fade
await self.tts.set_volume(fade_factor)
# Optionally, amplify captured user audio
# (useful if user speaks quietly during interruption)
await asyncio.sleep(0.02) # 20 ms
# Complete stop
await self.tts.cancel_streaming()
await self.tts.set_volume(1.0) # Reset for next response
Strategy 2: Finish Current Sentence
Don't interrupt in the middle of a word. Detect sentence boundaries and finish the current sentence before listening:
def get_sentence_boundary(self, text, current_char_pos):
"""
Find the end of the current sentence in TTS text.
Allows agent to finish a complete thought before listening.
"""
# Look for sentence-ending punctuation
end_markers = ['.', '!', '?']
for i in range(current_char_pos, len(text)):
if text[i] in end_markers:
# Found end of sentence
remaining_text = text[current_char_pos:i+1]
return remaining_text
# No sentence boundary found; return current word boundary
space_pos = text.find(' ', current_char_pos)
if space_pos != -1:
return text[current_char_pos:space_pos]
return text[current_char_pos:]
async def finish_sentence_before_listening(self):
"""
Agent finishes the current sentence, then listens to user.
"""
if not self.response_text or not self.agent_is_speaking:
return
# Calculate current playback position in TTS
elapsed_chars = self.estimate_chars_played()
remaining = self.get_sentence_boundary(self.response_text, elapsed_chars)
if remaining:
print(f"[Barge-In] Finishing sentence: {remaining}")
# Continue TTS for the remaining text
await self.tts.stream_async(remaining)
# Now ready to listen
await self._handle_barge_in_interrupt()
Tuning Barge-In Sensitivity
Test your barge-in detection with varying scenarios:
- User speaking loudly: Should always trigger (high SNR).
- User whispering: May not trigger if threshold is too high. Lower the threshold or use adaptive detection that accounts for ambient noise.
- Background noise: Should not trigger. Ensure noise floor estimation is accurate.
- Agent TTS: Should not trigger its own speech. Use separate channels (stereo) or time alignment to prevent echo.
Key Takeaways
- Barge-in detection identifies when a user interrupts agent speech. Two approaches: energy thresholding (simple, fast) and VAD with timing (more accurate).
- Integrate barge-in detection into your turn-taking state machine: when speaking, monitor microphone for interruption signals.
- Graceful interruption strategies include crossfading, finishing the current sentence, or replaying a shortened response summary.
- Tune barge-in sensitivity per deployment environment: loud call centers use high thresholds; quiet home offices use low thresholds.
- Test with diverse speakers: non-native accents, soft voices, and fast talkers all affect VAD-based barge-in detection.
Frequently Asked Questions
Can I detect barge-in on phone calls (PSTN)?
Yes, but with caveats. Phone audio is compressed and lower quality (8 kHz). Combine VAD with speech energy detection, and increase the grace period (barge-in_delay_ms) to 400–500 ms to account for network latency and compression artifacts.
What if the user has a strong accent?
WebRTC VAD (used in ML-based barge-in) is trained on diverse accents and handles them well. However, test in your target market. If VAD under-triggers, lower the energy threshold or increase the detection frame count (longer tolerance).
Should I let the user barge in at any time?
Mostly yes, but there are exceptions. During critical information playback (e.g., a phone number or confirmation), you may want to suppress barge-in briefly. Implement state-aware barge-in: allow interruptions for general conversation, suppress during critical sequences.
How do I handle rapid interruptions (user interrupts, agent responds, user interrupts again)?
Your turn-taking state machine should handle this naturally. Each time barge-in is detected, you transition back to LISTENING, capture the user's input, and generate a new response. This loops smoothly as long as state transitions are atomic and non-blocking.
Can barge-in work with multiple agents or group conversations?
Not without speaker diarization. Single-agent barge-in only needs to distinguish user from agent. Multi-agent scenarios require identifying which participant is speaking, which is significantly more complex. Use speaker diarization models (e.g., Pyannote) for this.