Skip to main content

Turn-Taking in Conversation: Managing Speaker Switching

Turn-taking is the dance of conversation: when one person speaks, the other listens, then roles reverse. In face-to-face dialogue, turn transitions happen nearly automatically via subtle cues—a speaker trails off, looks at the listener, pauses for breath. Humans navigate these cues subconsciously in ~300 ms. Voice agents must replicate this with explicit state machines and timing rules.

Poor turn-taking leads to unnatural conversations: the agent interrupts the user, or waits awkwardly silent after the user finishes. This article covers how to detect turn boundaries, manage conversation state, and keep transitions smooth and natural.

The Turn-Taking State Machine

A voice agent conversation has discrete states:

  1. Listening: The agent is capturing audio, waiting for user input.
  2. Processing: The user finished speaking (VAD detected silence). The agent is running STT, generating LLM response, and preparing TTS.
  3. Speaking: The agent is playing TTS audio back to the user.
  4. Finishing: The agent's response completes. System is ready for the next user turn.

Explicit state transitions prevent confusion:

[Listening] → (VAD silence after speech) → [Processing] 
↓ (user interrupts) ↓ (LLM output ready)
[Listening]◄──[Finishing]◄──[Speaking]

Here's a state machine in Python:

import asyncio
from enum import Enum
from datetime import datetime, timedelta

class ConversationState(Enum):
"""Enum for conversation states."""
LISTENING = "listening"
PROCESSING = "processing"
SPEAKING = "speaking"
FINISHING = "finishing"

class TurnTakingManager:
"""
Manages turn-taking state and transitions.
Coordinates STT, LLM, and TTS to ensure natural speaker switching.
"""

def __init__(self, vad_detector, stts_client, llm_client, tts_client):
"""
Initialize the turn-taking manager.

Args:
vad_detector: VAD instance (from article 3)
stts_client: Speech-to-text async client
llm_client: LLM async client (e.g., OpenAI)
tts_client: Text-to-speech async client
"""
self.vad = vad_detector
self.stt = stts_client
self.llm = llm_client
self.tts = tts_client

self.state = ConversationState.LISTENING
self.utterance_start = None
self.state_change_time = datetime.now()
self.pending_user_audio = bytearray() # Buffer for current turn
self.response_text = "" # Current LLM response

# Metrics
self.turn_count = 0
self.total_user_duration = 0
self.total_response_duration = 0

async def on_audio_frame(self, audio_frame):
"""
Process an incoming audio frame based on current state.

Args:
audio_frame: 16-bit PCM audio (20 ms chunk)
"""
is_speech, should_finalize = self.vad.detect_speech_boundaries(audio_frame)

if self.state == ConversationState.LISTENING:
if is_speech:
# User started speaking
if self.utterance_start is None:
self.utterance_start = datetime.now()
print("[Turn] User started speaking")

# Buffer audio for eventual STT
self.pending_user_audio.extend(audio_frame)

elif should_finalize and self.utterance_start is not None:
# User finished speaking; transition to processing
await self._transition_to_processing()

elif self.state == ConversationState.SPEAKING:
# User interrupted while agent is speaking
if is_speech:
print("[Turn] User interrupted agent. Stopping TTS.")
await self._handle_barge_in()

elif self.state == ConversationState.PROCESSING:
# Buffer any audio during processing (for future use or logging)
if is_speech:
self.pending_user_audio.extend(audio_frame)

async def _transition_to_processing(self):
"""Transition from LISTENING to PROCESSING state."""
if self.state != ConversationState.LISTENING:
return

self.state = ConversationState.PROCESSING
self.state_change_time = datetime.now()
user_duration_sec = (self.state_change_time - self.utterance_start).total_seconds()
self.total_user_duration += user_duration_sec
print(f"[Turn] Transitioning to PROCESSING (user spoke for {user_duration_sec:.1f}s)")

# Concurrently: STT + LLM (parallel, not sequential)
try:
# 1. Transcribe user audio in parallel with LLM thinking
transcript_task = self.stt.transcribe_async(bytes(self.pending_user_audio))

# While STT is running, we could pre-generate a response for common patterns,
# but let's keep it simple: wait for STT result first.
transcript = await transcript_task
print(f"[Turn] User said: {transcript}")

# 2. Generate LLM response
llm_response = await self.llm.chat_completion_async(
messages=[
{"role": "system", "content": "You are a helpful voice assistant. Keep responses under 20 words."},
{"role": "user", "content": transcript}
]
)
self.response_text = llm_response.choices[0].message.content
print(f"[Turn] LLM response: {self.response_text}")

# 3. Transition to SPEAKING
await self._transition_to_speaking()

except Exception as e:
print(f"[Turn] Error in processing: {e}")
self.state = ConversationState.LISTENING

async def _transition_to_speaking(self):
"""Transition from PROCESSING to SPEAKING state."""
if self.state != ConversationState.PROCESSING:
return

self.state = ConversationState.SPEAKING
self.state_change_time = datetime.now()
print(f"[Turn] Transitioning to SPEAKING")

# Stream TTS audio
try:
tts_stream = self.tts.stream_async(self.response_text)
async for audio_frame in tts_stream:
# Queue audio for playback
# If user interrupts, barge-in handler will stop playback
await self.audio_output.queue_output_audio(audio_frame)

# Check for interruption (handled in on_audio_frame)
if self.state != ConversationState.SPEAKING:
break # User interrupted; stop TTS playback

# TTS finished naturally
await self._transition_to_finishing()

except Exception as e:
print(f"[Turn] Error in TTS: {e}")
self.state = ConversationState.LISTENING

async def _transition_to_finishing(self):
"""Transition from SPEAKING to FINISHING (ready for next turn)."""
if self.state != ConversationState.SPEAKING:
return

self.state = ConversationState.FINISHING
self.state_change_time = datetime.now()
response_duration = (self.state_change_time - self.state_change_time).total_seconds()
self.total_response_duration += response_duration

self.turn_count += 1
print(f"[Turn] Completed turn #{self.turn_count}")

# Reset for next turn
self.pending_user_audio = bytearray()
self.utterance_start = None
self.response_text = ""

# Transition back to LISTENING
self.state = ConversationState.LISTENING
print(f"[Turn] Ready for next user input")

async def _handle_barge_in(self):
"""User interrupted agent; stop speaking and go back to listening."""
print("[Turn] Handling barge-in (user interrupted)")
self.state = ConversationState.FINISHING
# Stop TTS playback (this is handled by the TTS streaming loop above)
# and transition back to LISTENING
self.pending_user_audio = bytearray() # Reset audio buffer
self.utterance_start = None
self.state = ConversationState.LISTENING

def get_metrics(self):
"""Return conversation metrics."""
return {
"turns": self.turn_count,
"avg_user_duration_sec": self.total_user_duration / max(self.turn_count, 1),
"avg_response_duration_sec": self.total_response_duration / max(self.turn_count, 1),
}

Handling Overlap and Simultaneous Speech

In natural conversation, brief overlaps happen: you start speaking while the other person is still finishing a thought. Your agent should handle this gracefully:

  • Gentle overlap tolerance (0–200 ms): If the user starts speaking while the agent is in the last 200 ms of a response, don't immediately stop. This mimics natural human interruption.
  • Hard interrupt (200+ ms): If the user speaks during the bulk of the agent's response, immediately stop and listen.

Implement this with a timeout:

async def _handle_barge_in(self):
"""
Allow brief overlaps but stop if user keeps talking.
"""
# Check if we're in the tail end of TTS (last 200 ms)
remaining_audio = self.tts_stream.estimated_remaining_ms()

if remaining_audio > 200:
# User interrupted in the middle; stop immediately
print("[Turn] Hard interrupt detected; stopping TTS")
self.state = ConversationState.LISTENING
self.pending_user_audio = bytearray()
else:
# Gentle overlap; let agent finish naturally
print("[Turn] Gentle overlap in tail audio; continuing")

Timing: The Critical Path

The critical path of turn-taking is the time from when the user finishes speaking until the agent plays audio back:

User finishes → VAD detects silence (20 ms)
→ STT processes (300 ms)
→ LLM generates response (800 ms)
→ TTS begins (100 ms)
→ Agent audio starts playing (100 ms)
─────────────────────────────
Total: ~1.3 seconds

To optimize:

  1. Parallel STT and LLM: While STT finalizes the transcript, start sending to the LLM (you can send partial transcripts).
  2. Streaming TTS: Don't wait for the full TTS to generate. Stream audio frames as they're produced.
  3. Token buffering: For streaming LLMs, send tokens to TTS as soon as they arrive, before the full response is complete.

Key Takeaways

  • Turn-taking is a state machine with four states: LISTENING, PROCESSING, SPEAKING, FINISHING. Explicit transitions prevent race conditions and unnatural overlaps.
  • Coordinate STT, LLM, and TTS to minimize the critical path (user finishes → agent starts speaking). Target under 1.5 seconds.
  • VAD signals turn boundaries; handle false positives (user pauses mid-sentence) with longer silence thresholds.
  • Implement barge-in handling: allow brief overlaps but stop immediately if the user insists on speaking.
  • Log timing metrics to detect bottlenecks: if total latency exceeds 2 seconds, investigate STT or LLM latency.

Frequently Asked Questions

How do I prevent the agent from interrupting natural pauses?

Users often pause mid-sentence (to think or breathe). Increase the VAD silence threshold from 200 ms to 400–600 ms. Additionally, listen for verbal cues like "uh" or "um" (non-terminal vocalizations) that suggest the user isn't done speaking. More sophisticated systems train models to predict end-of-turn rather than relying on silence alone.

What if STT takes longer than expected?

Implement a timeout: if STT hasn't returned a result after 3–5 seconds, give up and ask the user to repeat. This prevents the agent from getting stuck in PROCESSING state forever. Additionally, return partial transcripts to the LLM while STT is still running, allowing early response generation.

Should the agent speak faster to reduce perceived latency?

Slightly, yes. Increase TTS speed from 1.0x to 1.1–1.2x. This feels natural (slightly energetic) and saves ~100 ms of playback time without sounding rushed. Beyond 1.2x, most users perceive the speech as unnatural.

How do I test turn-taking without a full voice setup?

Create a test harness that simulates VAD events and LLM responses. Mock the audio I/O and step through the state machine manually. Verify that state transitions follow the expected order and no transitions are skipped.

Can I have multiple agents listening simultaneously (e.g., a panel discussion)?

Yes, but you need speaker diarization (a model that identifies who is speaking) and a turn arbiter (logic to decide whose turn it is). This is significantly more complex than single-agent turn-taking and is beyond the scope of single-agent voice agents.

Further Reading