Skip to main content

Full-Duplex Audio Streaming: Bidirectional Conversations

Full-duplex audio streaming means both the user and the agent can speak at the same time, with audio flowing in both directions simultaneously over a single connection. This is fundamentally different from half-duplex (walkie-talkie style, where only one speaker can transmit at a time) or simplex (one-way only). Full-duplex is what makes voice agents feel natural—just like a phone call, where you can interrupt and be interrupted.

Technically, full-duplex requires careful buffer management, echo cancellation, and the ability to handle asynchronous audio frames arriving out of order or with variable delays. This article shows you how to implement full-duplex streaming, optimize buffer sizes for low latency, and avoid common pitfalls like audio overlap and feedback loops.

Understanding Audio Buffering and Synchronization

When both user and agent are producing audio simultaneously, you must manage two independent streams: input (user's microphone) and output (agent's speaker). A naive approach buffers all input, waits for output to finish, then plays it—but this serializes the conversation and kills the natural feel.

Instead, run input capture and output playback in separate threads or async tasks. Input continuously reads from the microphone and appends to a circular buffer; output independently reads from a queue of TTS frames and plays them. The LLM processes the input stream asynchronously, producing text output that gets fed to TTS.

The Triple-Buffer Pattern

A robust full-duplex system uses three buffers:

  1. Input Audio Buffer (microphone → STT): A ring buffer holding 200–500 ms of captured audio. New audio overwrites old data when the buffer fills, ensuring you never lose the most recent frames.
  2. Output Audio Buffer (TTS → speaker): A queue of TTS-generated audio frames. As TTS produces frames, they're queued; the playback thread drains the queue at real-time speed (16 kHz).
  3. Transcription Buffer (STT results): Holds interim and final transcripts from the STT API, allowing the LLM to process complete sentences without waiting for every character.

Here's a complete implementation in Python using asyncio for concurrency:

import asyncio
import numpy as np
import pyaudio
from collections import deque
from threading import Thread
import time

class FullDuplexAudioStream:
"""
Manages full-duplex audio I/O: simultaneously captures microphone input
and plays speaker output without blocking either stream.
"""

def __init__(self, sample_rate=16000, chunk_duration_ms=20):
self.sample_rate = sample_rate
self.chunk_size = int(sample_rate * chunk_duration_ms / 1000) # 320 samples at 16 kHz, 20 ms

# Input buffer: circular, holds up to 500 ms of audio (16 chunks of 20 ms each)
self.input_buffer = deque(maxlen=25) # ~500 ms at 20 ms chunks

# Output queue: FIFO for TTS audio frames waiting to be played
self.output_queue = asyncio.Queue(maxsize=50)

# PyAudio configuration
self.p = pyaudio.PyAudio()
self.input_stream = None
self.output_stream = None
self.running = False

def start_input(self):
"""Open microphone and start background capture thread."""
self.input_stream = self.p.open(
format=pyaudio.paInt16,
channels=1,
rate=self.sample_rate,
input=True,
frames_per_buffer=self.chunk_size
)
self.running = True

# Start input in background thread (non-blocking)
input_thread = Thread(target=self._capture_audio, daemon=True)
input_thread.start()
print("Input stream started (microphone capture)")

def start_output(self):
"""Open speaker and start background playback thread."""
self.output_stream = self.p.open(
format=pyaudio.paInt16,
channels=1,
rate=self.sample_rate,
output=True,
frames_per_buffer=self.chunk_size
)

# Start output in background thread
output_thread = Thread(target=self._playback_audio, daemon=True)
output_thread.start()
print("Output stream started (speaker playback)")

def _capture_audio(self):
"""Background thread: continuously capture audio from microphone."""
while self.running:
try:
# Read one chunk (320 samples = 20 ms at 16 kHz)
audio_data = self.input_stream.read(self.chunk_size, exception_on_overflow=False)

# Append to circular buffer (oldest data is auto-discarded when full)
self.input_buffer.append(audio_data)

# Log every 50 chunks (~1 second) to avoid spam
if len(self.input_buffer) % 50 == 0:
print(f"[Input] Captured {len(self.input_buffer) * 20} ms of audio")
except Exception as e:
print(f"Microphone error: {e}")
self.running = False

def _playback_audio(self):
"""Background thread: continuously play audio from output queue to speaker."""
while self.running:
try:
# Check if output queue has data (non-blocking)
if not self.output_queue.empty():
audio_frame = self.output_queue.get_nowait()
self.output_stream.write(audio_frame)
else:
# Queue is empty; sleep briefly to avoid spinning
time.sleep(0.001) # 1 ms
except asyncio.QueueEmpty:
time.sleep(0.001)
except Exception as e:
print(f"Speaker error: {e}")
self.running = False

def get_input_audio(self, duration_ms=100):
"""
Retrieve audio from input buffer (non-destructive read).
Useful for STT: grab the last 100 ms of captured audio.
"""
num_chunks = duration_ms // 20 # Convert ms to chunks
if len(self.input_buffer) < num_chunks:
return b""
# Concatenate the last N chunks
return b"".join(list(self.input_buffer)[-num_chunks:])

async def queue_output_audio(self, audio_frame):
"""
Queue an audio frame for playback. Called by TTS thread.
Blocks if queue is full (backpressure protection).
"""
await self.output_queue.put(audio_frame)

def stop(self):
"""Shut down both streams cleanly."""
self.running = False
if self.input_stream:
self.input_stream.stop_stream()
self.input_stream.close()
if self.output_stream:
self.output_stream.stop_stream()
self.output_stream.close()
self.p.terminate()
print("Audio streams closed")

# Example usage: run input and output concurrently
async def full_duplex_demo():
"""Demonstrate full-duplex: capture input, queue synthetic output, both running simultaneously."""
audio = FullDuplexAudioStream()
audio.start_input()
audio.start_output()

try:
# Simulate TTS output: every 2 seconds, queue a synthetic chirp
for i in range(5):
print(f"\n[{i+1}] Generating output audio...")
# Create a synthetic 1 kHz tone (440 Hz * 2.27 = ~1 kHz)
duration_samples = audio.sample_rate // 2 # 0.5 second
t = np.arange(duration_samples) / audio.sample_rate
tone = np.sin(2 * np.pi * 440 * t) * 0.3 # 440 Hz, amplitude 0.3
audio_frame = (tone * 32767).astype(np.int16).tobytes()

# Queue it for playback (this is non-blocking)
await audio.queue_output_audio(audio_frame)

# Meanwhile, the input thread is still capturing
input_audio = audio.get_input_audio(duration_ms=100)
print(f" Input buffer holds {len(input_audio)} bytes of audio")

await asyncio.sleep(2)
finally:
audio.stop()

asyncio.run(full_duplex_demo())

Handling Echo and Acoustic Feedback

When microphone and speaker are close (like a laptop), you risk picking up your own audio output in the microphone—creating a feedback loop. Full-duplex systems must implement echo cancellation: detecting speaker output in the microphone stream and removing it.

Echo Cancellation Strategies

  1. Simple Delay-and-Subtract: If you know the exact delay between speaker output and microphone pickup (typically 50–150 ms), subtract the speaker signal from the microphone signal. This works if the delay is stable but fails with variable network latencies.

  2. Adaptive Filtering: Use algorithms like NLMS (Normalized Least-Mean-Square) to dynamically estimate and remove echo. The WebRTC project's echo cancellation module (used in Chrome, Teams, Zoom) is production-grade and open-source.

  3. API-Level Support: Many realtime speech APIs (OpenAI, Google, Azure) include echo cancellation built-in. The server knows when it's sending audio and can remove it from the incoming stream automatically.

For critical applications, use the WebRTC echo cancellation library:

# Pseudocode: using webrtcvad for voice activity + echo removal
import webrtcvad
from scipy import signal

class EchoCancellingAudioStream:
"""
Detects echo in microphone signal and removes it.
Assumes speaker output is available for comparison.
"""

def __init__(self, sample_rate=16000):
self.sample_rate = sample_rate
self.vad = webrtcvad.VAD(mode=2) # Aggressive VAD
self.speaker_buffer = deque(maxlen=2400) # 150 ms at 16 kHz

def estimate_echo(self, microphone_frame, speaker_frame):
"""
Simple cross-correlation to estimate echo delay and magnitude.
Returns (delay_samples, echo_magnitude).
"""
# Convert bytes to numpy array
mic = np.frombuffer(microphone_frame, dtype=np.int16).astype(float)
spk = np.frombuffer(speaker_frame, dtype=np.int16).astype(float)

# Cross-correlate to find delay
correlation = signal.correlate(mic, spk, mode='same')
delay = np.argmax(correlation)
magnitude = np.max(correlation) / (np.std(mic) * np.std(spk) + 1e-8)

return delay, magnitude

def remove_echo(self, microphone_frame, speaker_history):
"""
Remove estimated echo from microphone frame using speaker history.
"""
# Estimate echo parameters
delay, mag = self.estimate_echo(microphone_frame, speaker_history)

# If echo magnitude is significant, subtract scaled speaker signal
if mag > 0.3: # Threshold: echo is present if correlation > 0.3
mic = np.frombuffer(microphone_frame, dtype=np.int16).astype(float)
spk = np.frombuffer(speaker_history, dtype=np.int16).astype(float)

# Shift and subtract
cleaned = mic - 0.8 * spk[delay:delay+len(mic)]
return np.clip(cleaned, -32768, 32767).astype(np.int16).tobytes()

return microphone_frame # No echo detected

Optimizing Latency in Full-Duplex

Every buffer adds latency. To minimize total end-to-end delay:

  • Chunk size: Use 20 ms chunks (320 samples at 16 kHz). Smaller chunks add overhead; larger chunks add delay.
  • Buffer depth: Keep input and output buffers at 100–200 ms total depth. More buffering improves robustness to network jitter but increases latency.
  • Playback buffering: Begin playing output audio as soon as the first frame arrives (no pre-buffering). This reduces TTS-to-ear latency by 100–200 ms.

Key Takeaways

  • Full-duplex audio streaming requires independent input and output threads/tasks running concurrently, each managing its own buffer.
  • Use circular buffers for input (to always hold recent audio) and queues for output (to feed TTS frames to the speaker in order).
  • Echo cancellation is mandatory for speaker-near-microphone setups; use WebRTC's module or API-provided solutions.
  • Latency optimization focuses on chunk size (20 ms is standard), buffer depth (100–200 ms), and immediate playback without pre-buffering.
  • Test full-duplex with varied network conditions (high latency, packet loss) to ensure the system remains stable.

Frequently Asked Questions

What is the maximum latency I can tolerate in full-duplex audio?

Most users start perceiving latency around 150 ms. Above 300 ms, conversations become noticeably awkward. For real-time voice agents, target under 500 ms end-to-end (user speaks → STT → LLM → TTS → user hears response). Anything above 1 second feels broken.

Can I use full-duplex over the internet?

Yes, but network jitter becomes critical. Internet connections introduce variable delays (50–500 ms), which requires larger buffers to prevent audio dropouts. For mission-critical voice (emergency dispatch), use dedicated carrier networks (PSTN, SIP trunks) rather than the public internet.

How much CPU does full-duplex audio processing use?

Microphone capture and speaker playback together use 5–15% CPU on a modern processor, depending on sample rate and system load. Echo cancellation adds 10–20%. If you're also running STT/TTS on-device (not cloud-based), expect 30–50% CPU.

What if the output queue backs up?

If the speaker playback thread can't keep up with TTS production (e.g., TTS is faster than real-time), the output queue fills. Implement backpressure: block the TTS producer until the queue has room. This prevents memory from growing unbounded.

How do I test echo cancellation quality?

Record a full-duplex conversation with echo cancellation enabled, then inspect the microphone stream: echo should be attenuated by at least 20 dB (10x quieter). Test with various speaker volumes and distances. The Speex echo cancellation library includes a diagnostic mode that measures effectiveness.

Further Reading