Speaker Diarization: Who Spoke When
Speaker diarization answers the question "who spoke when?" by partitioning audio into segments and assigning a unique label (Speaker 1, Speaker 2, etc.) to each. Unlike speaker recognition (which identifies a specific known speaker), diarization is unsupervised clustering—the system has no prior knowledge of speakers and discovers groups automatically. Diarization is essential for meeting transcripts, interview analysis, and dialogue applications, enabling readability and downstream speaker-specific processing. State-of-the-art diarization (pyannote, Nvidia NeMo) achieves 95–99% accuracy on clean two-speaker audio and 85–95% on multi-speaker or noisy recordings like Zoom calls.
How Speaker Diarization Works: Embedding and Clustering
Speaker diarization pipelines follow a consistent architecture: (1) detect speech activity (remove silence), (2) extract speaker embeddings (neural features capturing speaker identity), (3) cluster embeddings by similarity, and (4) map clusters back to time. Speaker embeddings are fixed-length vectors (typically 256–512 dimensions) that capture speaker-specific characteristics like pitch, formants, and vocal tract resonance. Two embeddings from the same speaker should be close in vector space; embeddings from different speakers should be far apart. Clustering algorithms like spectral clustering or agglomerative hierarchical clustering then group embeddings by distance.
The simplest approach: divide audio into overlapping 500–2,000 ms windows, extract one embedding per window, cluster embeddings, and interpolate cluster assignments back to the original time axis. More sophisticated methods use speaker change detection to place cluster boundaries at natural speaker transitions, reducing false splits.
# Conceptual speaker diarization pipeline
import numpy as np
from sklearn.cluster import AgglomerativeClustering
def diarize_audio(audio_path, num_speakers=None):
"""
Diarize audio: partition into speaker segments.
Returns: list of (start_time_ms, end_time_ms, speaker_label)
"""
# Step 1: Load and preprocess audio
y, sr = librosa.load(audio_path, sr=16000)
duration_ms = int(len(y) / sr * 1000)
# Step 2: Extract speaker embeddings every 500 ms
embedding_dim = 256
hop_ms = 500
embeddings = []
times = []
for start_ms in range(0, duration_ms - hop_ms, hop_ms):
start_sample = int(start_ms * sr / 1000)
end_sample = int((start_ms + 500) * sr / 1000)
window = y[start_sample:end_sample]
# In practice, use a pretrained speaker embedding model
# (e.g., pyannote, Nvidia NeMo, or SpeechBrain)
embedding = extract_embedding(window, sr) # Returns 256-dim vector
embeddings.append(embedding)
times.append(start_ms)
embeddings = np.array(embeddings)
# Step 3: Cluster embeddings
if num_speakers is None:
# Auto-detect number of speakers using silhouette score
num_speakers = estimate_num_speakers(embeddings)
clusterer = AgglomerativeClustering(
n_clusters=num_speakers,
linkage='complete',
metric='cosine'
)
labels = clusterer.fit_predict(embeddings)
# Step 4: Interpolate labels back to original time resolution
diarization = []
current_label = labels[0]
segment_start = 0
for i in range(1, len(labels)):
if labels[i] != current_label:
# Speaker change detected
diarization.append((
segment_start,
times[i],
f"Speaker_{current_label + 1}"
))
segment_start = times[i]
current_label = labels[i]
# Add final segment
diarization.append((segment_start, duration_ms, f"Speaker_{current_label + 1}"))
return diarization
def extract_embedding(audio, sr):
"""Stub: Use pyannote or similar to extract speaker embeddings."""
return np.random.rand(256) # Placeholder
The quality of embeddings is critical: embeddings must be robust to accent, emotion, background noise, and audio quality while being sensitive to speaker identity. Pretrained models trained on millions of speaker examples vastly outperform custom embeddings.
Pyannote: State-of-the-Art Diarization
Pyannote is the community standard for speaker diarization, achieving 99% accuracy on clean speech and 85–95% on real-world recordings. Install and use in three lines:
from pyannote.audio import Pipeline
import torch
# Initialize (downloads pretrained model on first run)
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1")
# Diarize audio
diarization = pipeline("meeting_audio.wav")
# Iterate over segments
for turn, _, speaker in diarization.itertracks(yield_label=True):
print(f"{turn.start:.2f}s - {turn.end:.2f}s: {speaker}")
# Output:
# 0.00s - 2.34s: Speaker 1
# 2.35s - 5.12s: Speaker 2
# 5.13s - 8.00s: Speaker 1
Pyannote returns a Diarization object with overlap handling (overlapping speech is assigned to multiple speakers). You can extract speaker segments or merge with transcripts:
from pyannote.audio import Pipeline
from openai import OpenAI
# Get diarization
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1")
diarization = pipeline("meeting.wav")
# Get transcript with timestamps from Whisper
client = OpenAI()
with open("meeting.wav", "rb") as f:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=f,
response_format="verbose_json"
)
# Align speakers with words
words = transcript.words # List of {word, start, end}
speaker_segments = [(turn.start, turn.end, speaker) for turn, _, speaker in diarization.itertracks(yield_label=True)]
# Assign speaker to each word
for word_info in words:
word_start = word_info.start
assigned_speaker = None
for seg_start, seg_end, speaker in speaker_segments:
if seg_start <= word_start < seg_end:
assigned_speaker = speaker
break
print(f"{assigned_speaker}: {word_info.word}")
# Output:
# Speaker 1: Hello
# Speaker 2: Hi
# Speaker 1: How
# Speaker 1: are
# Speaker 2: you
This pattern produces a fully annotated transcript with speaker labels and timing, ideal for meeting minutes and dialogue analysis.
Handling Edge Cases: Overlapping Speech and Speaker Changes
Real-world audio often has overlapping speakers (simultaneous speech) and rapid speaker changes. Pyannote handles overlap by marking segments with multiple speakers:
# Overlapping speech handling
for turn, _, speaker in diarization.itertracks(yield_label=True):
if turn.duration < 0.1:
print(f"SKIPPED (too short): {turn} - {speaker}")
elif "," in str(speaker): # Multiple speakers
print(f"OVERLAP: {turn.start:.2f}s - {turn.end:.2f}s: {speaker}")
else:
print(f"SINGLE: {turn.start:.2f}s - {turn.end:.2f}s: {speaker}")
# To handle overlaps, merge speakers with strict timing
def merge_overlapping_speakers(diarization, min_overlap_ms=200):
"""Merge overlapping segments from different speakers."""
tracks = [(turn.start, turn.end, speaker) for turn, _, speaker in diarization.itertracks(yield_label=True)]
merged = []
for i, (start, end, speaker) in enumerate(tracks):
if i == 0:
merged.append((start, end, speaker))
continue
prev_start, prev_end, prev_speaker = merged[-1]
overlap = min(end, prev_end) - max(start, prev_start)
if overlap > min_overlap_ms / 1000:
# Overlapping; merge into previous segment
merged[-1] = (prev_start, max(end, prev_end), prev_speaker)
else:
merged.append((start, end, speaker))
return merged
For rapid speaker changes (< 100 ms), consider merging short segments or manually reviewing boundaries. Pyannote allows tuning via the min_duration_on and min_duration_off parameters to control segment length thresholds.
Multi-Speaker Diarization: When Pyannote Fails
Pyannote performs best on 2–4 speakers. On highly multi-speaker audio (e.g., conference calls with 10+ participants), accuracy degrades. Strategies:
- Pre-cluster by acoustic characteristics (pitch, energy) to reduce confusion.
- Use speaker enrollment: If you have short audio samples from known speakers, compute their embeddings and assign transcript segments to nearest neighbors.
- Manual review: Difficult audio (overlapping speech, many speakers, accents) benefits from human correction.
# Speaker enrollment: known speakers
enrollment_audios = {
"alice": "alice_sample.wav",
"bob": "bob_sample.wav"
}
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1")
# Compute speaker embeddings
speaker_embeddings = {}
for name, audio_path in enrollment_audios.items():
embedding = pipeline.model.embedding(audio_path)
speaker_embeddings[name] = embedding
# For each diarization segment, find nearest known speaker
from scipy.spatial.distance import cosine
diarization = pipeline("multi_speaker_call.wav")
for turn, _, cluster in diarization.itertracks(yield_label=True):
# Extract embedding for this segment (stub)
segment_embedding = extract_segment_embedding(turn)
# Find nearest known speaker
best_speaker = min(
speaker_embeddings.items(),
key=lambda x: cosine(segment_embedding, x[1])
)
print(f"{turn.start:.2f}s - {turn.end:.2f}s: {best_speaker[0]}")
This pattern is useful for known-participant meetings; for truly unknown speakers, pure clustering remains the only option.
Key Takeaways
- Speaker diarization partitions audio into speaker segments using embedding extraction and clustering; achieves 95–99% accuracy on clean speech, 85–95% on real-world audio.
- Pyannote is state-of-the-art and production-ready; use
Pipeline.from_pretrained()to load and diarize in seconds. - Merge diarization with Whisper transcripts to produce fully annotated speaker-labeled transcripts.
- Overlapping speech is represented as multiple speakers; merge short segments or use speaker enrollment for multi-speaker audio.
- Real-world audio (meetings, calls) requires manual review of diarization boundaries and speaker merging.
Frequently Asked Questions
How accurate is speaker diarization on real-world audio?
Pyannote achieves 95–99% diarization error rate (DER) on clean broadcast speech. On real-world Zoom calls or conferences with background noise and overlapping speech, DER is typically 10–20% (85–90% correct). Accuracy degrades with more speakers (beyond 4) and lower audio quality.
Can speaker diarization work with only two speakers?
Yes, and it works very well: two-speaker diarization is nearly perfect (95–99% accuracy). The algorithm is easier with fewer clusters. Many applications (interviews, podcasts) use two-speaker audio.
Do I need to specify the number of speakers in advance?
Pyannote automatically estimates the number of speakers using silhouette score or other metrics. You can also provide a hint with the num_speakers parameter if you know it. Auto-detection works for 1–6 speakers; beyond that, provide a hint.
How do I align diarization with word-level timestamps from Whisper?
Extract speaker segments and word timestamps from Whisper, then assign each word to the speaker segment it overlaps with. Use the earlier code example: iterate words, find the speaker segment that contains the word's start time, and label accordingly.
Can speaker diarization identify who specifically is speaking (not just Speaker 1 vs Speaker 2)?
Diarization alone cannot. To identify speakers by name, use speaker recognition (1-to-1 matching against known speakers) or speaker verification. Load enrollment audio for known speakers and compute embeddings; then match each diarization cluster to the nearest enrollment.
Further Reading
- Pyannote Speaker Diarization Documentation — Official repository and tutorial.
- Speaker Diarization Index (SPK.Ninja) — Benchmarks and dataset links.
- End-to-End Speaker Diarization with Attention-Based Fully Convolutional Networks — Foundational diarization research.
- SpeechBrain Speaker Embeddings — Alternative embedding models.