Timestamp Alignment: Word-Level Timing
Forced alignment is the process of taking a known transcript and computing exact start/end times for every word in the audio. Unlike ASR, which produces transcription, alignment assumes the transcript is correct and finds where each word occurs. Alignment is essential for subtitles, interactive transcripts, and speaker-specific segment extraction. Tools like Montreal Forced Aligner (MFA) and Whisper's built-in alignment achieve word-level accuracy within 50–200 ms, while commercial tools (Google, Azure) and fine-tuned models can reach 20–50 ms error. Alignment is faster and more accurate than ASR alone because it leverages the known text as a strong constraint.
How Forced Alignment Works: Dynamic Time Warping and Beam Search
Forced alignment aligns acoustic observations (acoustic features extracted from audio) to words using dynamic programming. The core algorithm resembles Viterbi decoding in HMMs: for each time frame in the audio and each word in the transcript, compute the probability that the audio frame contains that word. Then find the path through this cost matrix that minimizes total cost while respecting the word order. The result is a sequence of time boundaries where each word begins and ends.
The simplest approach uses a phonetic lexicon: convert the transcript to phoneme sequences, extract acoustic features from the audio, and use a phoneme-to-audio alignment algorithm to place boundaries. Montreal Forced Aligner (MFA) implements this and is the community standard.
# Conceptual forced alignment using dynamic time warping (DTW)
import librosa
import numpy as np
def align_transcript_to_audio(audio_path, transcript_words, sr=16000):
"""
Align transcript words to audio timestamps using DTW.
Returns: list of (word, start_time_seconds, end_time_seconds)
"""
# Load and extract acoustic features
y, sr = librosa.load(audio_path, sr=sr)
# Extract MFCC or log-mel features
S = librosa.feature.melspectrogram(y=y, sr=sr)
S_db = librosa.power_to_db(S, ref=np.max)
# Convert words to phoneme sequences (using a lexicon)
phoneme_sequences = [word_to_phonemes(w) for w in transcript_words]
all_phonemes = [p for seq in phoneme_sequences for p in seq]
# For each time frame, compute probability of each phoneme (acoustic model)
# This is a stub: in practice, use a pretrained phoneme classifier
num_frames = S_db.shape[1]
num_phonemes = len(set(all_phonemes))
acoustic_scores = np.random.rand(num_frames, num_phonemes) # Softmax probabilities
# Viterbi alignment: find best path through (word, time) trellis
path = viterbi_align(acoustic_scores, phoneme_sequences)
# Convert frame indices to time and extract word boundaries
alignments = []
frame_duration = len(y) / sr / num_frames
for word, start_frame, end_frame in extract_word_boundaries(path):
start_time = start_frame * frame_duration
end_time = end_frame * frame_duration
alignments.append((word, start_time, end_time))
return alignments
def word_to_phonemes(word):
"""Convert word to phonemes (stub)."""
# In practice, use a pronunciation lexicon like CMU dict
return list(word.upper()) # Placeholder: each letter is a "phoneme"
def viterbi_align(acoustic_scores, word_phoneme_sequences):
"""Stub: Implement Viterbi alignment."""
return []
def extract_word_boundaries(path):
"""Stub: Extract word-level boundaries from alignment path."""
return []
Montreal Forced Aligner handles this automatically and achieves 50–100 ms accuracy on English speech with common accents.
Montreal Forced Aligner: Installation and Usage
MFA is the standard tool for forced alignment. Install and use:
pip install Montreal-Forced-Aligner
# Download a pretrained English model
mfa download acoustic english_us_arpa
mfa download g2p english_us_arpa
Then align audio and transcripts:
# Prepare input:
# audio/
# meeting.wav
# transcripts/
# meeting.txt (one sentence per line, or one line per utterance)
mfa align \
audio \
transcripts \
english_us_arpa \
english_us_arpa \
output_dir
# Output: TextGrid files with word-level and phone-level alignments
MFA produces TextGrid files (Praat format) containing alignment boundaries. Parse them in Python:
import textgrid
import json
def read_mfa_output(textgrid_path):
"""Read MFA alignment and extract word-level timestamps."""
tg = textgrid.TextGrid.fromFile(textgrid_path)
# MFA produces a two-tier TextGrid: words and phones
word_tier = tg.getFirst("words")
alignments = []
for interval in word_tier:
if interval.mark: # Skip empty intervals (silence)
alignments.append({
"word": interval.mark,
"start": interval.minTime,
"end": interval.maxTime
})
return alignments
# Read and display
alignments = read_mfa_output("output_dir/meeting.TextGrid")
for align in alignments[:10]:
print(f"{align['start']:.2f}s - {align['end']:.2f}s: {align['word']}")
# Output:
# 0.45s - 0.68s: Hello
# 0.68s - 1.02s: everyone
# 1.02s - 1.35s: welcome
MFA requires phonetic resources (acoustic models and grapheme-to-phoneme rules) for each language. English is well-supported; other languages have varying quality. MFA is free, open-source, and can be run on-device for privacy.
Whisper's Built-In Alignment
OpenAI's Whisper API includes word-level timestamps in the verbose_json response format. These timestamps are often accurate enough for captions without additional alignment:
from openai import OpenAI
client = OpenAI()
with open("meeting.mp3", "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="verbose_json"
)
# Extract word-level timestamps
for word_info in response.words:
print(f"{word_info.start:.2f}s - {word_info.end:.2f}s: {word_info.word}")
# Output:
# 0.00s - 0.32s: Good
# 0.32s - 0.45s: morning
# 0.45s - 1.02s: everyone
# Save as JSON
import json
with open("transcript_with_timestamps.json", "w") as f:
json.dump([{
"word": w.word,
"start": w.start,
"end": w.end
} for w in response.words], f)
Whisper's timestamps are typically accurate within 100–300 ms. For higher precision, use MFA on the Whisper transcript.
Combining Alignment with Diarization for Speaker-Timestamped Transcripts
You can combine word-level alignment with speaker diarization to produce transcripts with both speaker labels and timestamps:
from pyannote.audio import Pipeline
from openai import OpenAI
import json
# Get diarization
diarization_pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1")
diarization = diarization_pipeline("meeting.wav")
# Get transcript with alignment
client = OpenAI()
with open("meeting.wav", "rb") as f:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=f,
response_format="verbose_json"
)
# Get speaker segments
speaker_segments = []
for turn, _, speaker in diarization.itertracks(yield_label=True):
speaker_segments.append({
"start": turn.start,
"end": turn.end,
"speaker": speaker
})
# Align speakers with words
output = []
for word_info in transcript.words:
word_start = word_info.start
speaker = "Unknown"
# Find the speaker segment containing this word
for seg in speaker_segments:
if seg["start"] <= word_start < seg["end"]:
speaker = seg["speaker"]
break
output.append({
"word": word_info.word,
"start": word_info.start,
"end": word_info.end,
"speaker": speaker
})
# Save
with open("meeting_transcript.json", "w") as f:
json.dump(output, f, indent=2)
# Print sample
for item in output[:20]:
print(f"[{item['start']:.2f}s] {item['speaker']}: {item['word']}")
This produces a fully annotated transcript: each word has a speaker, start time, and end time, ideal for generating VTT subtitles or interactive transcripts.
Converting Aligned Transcripts to Subtitles (VTT/SRT)
WebVTT (.vtt) and SubRip (.srt) are standard subtitle formats. Generate them from aligned timestamps:
def generate_vtt(words_with_speakers, output_file):
"""Generate WebVTT subtitle file from aligned words."""
with open(output_file, "w") as f:
f.write("WEBVTT\n\n")
i = 0
while i < len(words_with_speakers):
# Group words into ~6 second chunks for readability
chunk_start = words_with_speakers[i]["start"]
chunk_end = chunk_start + 6
chunk_words = []
speaker = words_with_speakers[i]["speaker"]
while i < len(words_with_speakers) and words_with_speakers[i]["start"] < chunk_end:
chunk_words.append(words_with_speakers[i]["word"])
i += 1
# Format timestamp
def fmt_time(seconds):
h = int(seconds) // 3600
m = (int(seconds) % 3600) // 60
s = seconds % 60
return f"{h:02d}:{m:02d}:{s:06.3f}"
chunk_text = " ".join(chunk_words)
f.write(f"{fmt_time(chunk_start)} --> {fmt_time(words_with_speakers[i-1]['end'])}\n")
f.write(f"{speaker}: {chunk_text}\n\n")
# Usage
with open("meeting_transcript.json") as f:
words = json.load(f)
generate_vtt(words, "meeting.vtt")
The resulting .vtt file can be embedded in HTML5 video or used with subtitle players.
Key Takeaways
- Forced alignment maps known transcript words to audio timestamps, achieving 50–200 ms accuracy without training.
- Montreal Forced Aligner (MFA) is the community standard; use
mfa alignto generate TextGrid files with word-level boundaries. - Whisper API includes word-level timestamps in
response_format="verbose_json"; accuracy is typically 100–300 ms, sufficient for captions. - Combine alignment with diarization to produce fully annotated transcripts: word, speaker, start time, end time.
- Convert aligned transcripts to WebVTT or SubRip format for video players and subtitle tools.
Frequently Asked Questions
How accurate is forced alignment compared to manual transcription?
Forced alignment on clean English is typically 50–100 ms error (±50–100 ms). On noisy audio or non-native accents, error increases to 200–500 ms. Manual human annotation achieves ~20 ms error but is labor-intensive. For most applications, alignment is accurate enough.
Can MFA handle non-English languages?
MFA has models for many languages (French, Spanish, Mandarin, etc.), but quality varies. English and other high-resource languages have 50–100 ms error; low-resource languages may have 200–500 ms error. Check the MFA documentation for language availability.
Do I need to run MFA locally or can I use a cloud service?
You can run MFA locally (open-source) or use cloud services. Google Cloud Speech-to-Text and Azure Speech Services include word-level timestamps in their APIs. For privacy or cost, local MFA is ideal; for simplicity, cloud APIs are convenient.
What if my transcript has errors? Will alignment still work?
MFA assumes the transcript is correct. If the transcript contains errors or omissions, alignment may fail or produce incorrect boundaries. Transcribe first (using Whisper or manual transcription), then align.
How do I handle overlapping or simultaneous speech in alignment?
Standard alignment works on a single continuous utterance. For overlapping speech, segment audio by speaker first (using diarization), then align each speaker's segment independently. Merge the results afterward.
Further Reading
- Montreal Forced Aligner Documentation — Installation and usage guide.
- Praat TextGrid Format — How to read alignment output files.
- WebVTT Specification — Standard for subtitle files.
- OpenAI Whisper API Documentation — Detailed info on word-level timestamps.