Meeting Transcription and Summarization
Converting a meeting recording into a searchable, speaker-labeled transcript with a one-page summary and action items is an end-to-end audio AI application. The pipeline combines ASR (speech-to-text), diarization (speaker identification), alignment (word-level timestamps), and LLM prompt engineering (summary and action extraction). Production systems must handle Zoom/Teams recordings, multiple speakers, interruptions, and technical jargon while producing output readable within seconds. Building this pipeline from scratch involves orchestrating five separate models; using it as a reference workflow teaches the full stack of audio AI techniques covered in this series.
Reference: Complete Meeting-to-Summary Pipeline
Here's a working pipeline that transcribes a Zoom recording, diarizes speakers, aligns timestamps, refines text, and extracts action items:
import os
import json
import librosa
import soundfile as sf
from openai import OpenAI
from anthropic import Anthropic
from pyannote.audio import Pipeline as DiarizationPipeline
def meeting_to_summary(meeting_audio_path, output_dir="meeting_output"):
"""
Complete meeting transcription + diarization + summarization pipeline.
Output:
- meeting_output/raw_transcript.json (words with timestamps and speakers)
- meeting_output/refined_transcript.md (speaker-labeled cleaned text)
- meeting_output/summary.md (1-page executive summary)
- meeting_output/action_items.json (structured action items)
"""
os.makedirs(output_dir, exist_ok=True)
# Step 1: Transcribe with Whisper + timestamps
print("Step 1: Transcribing audio...")
openai_client = OpenAI()
with open(meeting_audio_path, "rb") as f:
whisper_response = openai_client.audio.transcriptions.create(
model="whisper-1",
file=f,
response_format="verbose_json"
)
words_with_timestamps = [{
"word": w.word,
"start": w.start,
"end": w.end
} for w in whisper_response.words]
# Step 2: Diarize (identify speakers)
print("Step 2: Identifying speakers...")
diarization_pipeline = DiarizationPipeline.from_pretrained(
"pyannote/speaker-diarization-3.1"
)
diarization = diarization_pipeline(meeting_audio_path)
# Extract speaker segments
speaker_segments = []
for turn, _, speaker in diarization.itertracks(yield_label=True):
speaker_segments.append({
"start": turn.start,
"end": turn.end,
"speaker": speaker
})
# Step 3: Align speakers with words
print("Step 3: Aligning speakers with words...")
for word in words_with_timestamps:
word["speaker"] = "Unknown"
for segment in speaker_segments:
if segment["start"] <= word["start"] < segment["end"]:
word["speaker"] = segment["speaker"]
break
# Save raw transcript
with open(f"{output_dir}/raw_transcript.json", "w") as f:
json.dump(words_with_timestamps, f, indent=2)
# Step 4: Refine transcript with LLM
print("Step 4: Refining transcript...")
anthropic_client = Anthropic()
# Group words by speaker turn
turns = []
current_turn = {"speaker": words_with_timestamps[0]["speaker"], "words": []}
for word in words_with_timestamps:
if word["speaker"] != current_turn["speaker"]:
turns.append(current_turn)
current_turn = {"speaker": word["speaker"], "words": []}
current_turn["words"].append(word["word"])
turns.append(current_turn)
# Format for refinement
raw_text = ""
for turn in turns:
raw_text += f"{turn['speaker']}: {' '.join(turn['words'])}\n"
# Refine using Claude
refinement_prompt = f"""You are a professional meeting transcription editor. Refine this meeting transcript:
- Fix punctuation and capitalization
- Correct homophones and obvious ASR errors
- Preserve speaker names and labels
- Add paragraph breaks for readability
- Remove filler words (um, uh, like) unless significant
TRANSCRIPT:
{raw_text}
REFINED TRANSCRIPT:"""
refinement_response = anthropic_client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=8000,
messages=[{"role": "user", "content": refinement_prompt}]
)
refined_transcript = refinement_response.content[0].text
with open(f"{output_dir}/refined_transcript.md", "w") as f:
f.write(refined_transcript)
# Step 5: Generate summary and extract action items
print("Step 5: Generating summary and action items...")
summary_prompt = f"""Analyze this meeting transcript and generate:
1. A 1-page executive summary (key decisions and topics)
2. Structured action items (task, owner, deadline if mentioned)
3. Questions or blockers raised
Format your response as:
## EXECUTIVE SUMMARY
[2-3 paragraphs summarizing key decisions, topics, and outcomes]
## ACTION ITEMS
- Task: [task description]
Owner: [person responsible]
Deadline: [if mentioned, otherwise "TBD"]
- ...
## BLOCKERS & QUESTIONS
[Any open questions or blockers mentioned]
TRANSCRIPT:
{refined_transcript}"""
summary_response = anthropic_client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=2000,
messages=[{"role": "user", "content": summary_prompt}]
)
summary_text = summary_response.content[0].text
with open(f"{output_dir}/summary.md", "w") as f:
f.write(summary_text)
# Extract action items as JSON
extraction_prompt = f"""From this meeting transcript, extract action items as JSON.
Respond ONLY with JSON in this format:
{{
"action_items": [
{{"task": "...", "owner": "...", "deadline": "..."}},
...
]
}}
TRANSCRIPT:
{refined_transcript}"""
extraction_response = anthropic_client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1000,
messages=[{"role": "user", "content": extraction_prompt}]
)
try:
action_items_json = json.loads(extraction_response.content[0].text)
except json.JSONDecodeError:
# Fallback if JSON parsing fails
action_items_json = {"action_items": []}
with open(f"{output_dir}/action_items.json", "w") as f:
json.dump(action_items_json, f, indent=2)
print(f"\nOutput saved to {output_dir}/")
return {
"transcript_path": f"{output_dir}/refined_transcript.md",
"summary_path": f"{output_dir}/summary.md",
"action_items_path": f"{output_dir}/action_items.json"
}
# Usage
if __name__ == "__main__":
results = meeting_to_summary("zoom_recording.wav")
# Display results
with open(results["summary_path"]) as f:
print("SUMMARY:\n" + f.read())
with open(results["action_items_path"]) as f:
items = json.load(f)
print("\nACTION ITEMS:")
for item in items["action_items"]:
print(f" - {item['task']} (Owner: {item['owner']}, Due: {item['deadline']})")
This pipeline produces meeting minutes suitable for immediate distribution to attendees.
Handling Meeting-Specific Challenges
Real meetings have interruptions, overlapping speech, technical jargon, and participants who dial in from noisy locations. Handle these:
def preprocess_meeting_audio(audio_path):
"""
Preprocess Zoom/Teams recordings to improve transcription.
"""
y, sr = librosa.load(audio_path, sr=16000)
# Normalize volume (many Zoom recordings have variable levels)
from librosa.effects import normalize
y = normalize(y)
# Apply light denoising (spectral subtraction)
D = librosa.stft(y)
magnitude = np.abs(D)
phase = np.angle(D)
# Estimate noise from first 1 second (usually silence/low speech)
noise_frames = int(16000 / 512) # ~1 second
noise_spectrum = np.mean(magnitude[:, :noise_frames], axis=1, keepdims=True)
# Subtract noise lightly (factor 0.8 to preserve speech)
cleaned_magnitude = magnitude - 0.8 * noise_spectrum
cleaned_magnitude = np.maximum(cleaned_magnitude, 0.1 * magnitude)
D_cleaned = cleaned_magnitude * np.exp(1j * phase)
y_cleaned = librosa.istft(D_cleaned)
# Save preprocessed audio
sf.write("preprocessed_meeting.wav", y_cleaned, sr)
return "preprocessed_meeting.wav"
# For especially noisy meetings, use deeper preprocessing
def denoise_with_demucs(audio_path):
"""Use Demucs for stronger denoising on noisy calls."""
import subprocess
subprocess.run([
"demucs",
"-n", "mdx_extra",
"-o", "denoised",
audio_path
])
return "denoised/mdx_extra/meeting/vocals.wav"
Generating Meeting Notes in Different Formats
Output meeting notes in multiple formats for different use cases:
def export_meeting_output(refined_transcript_path, action_items_path, output_format="markdown"):
"""
Export meeting notes in different formats.
Formats: 'markdown', 'html', 'json', 'outlook_calendar'
"""
with open(refined_transcript_path) as f:
transcript = f.read()
with open(action_items_path) as f:
action_items = json.load(f)
if output_format == "markdown":
# Already generated
return transcript
elif output_format == "html":
# Convert to HTML for email
import markdown
html = markdown.markdown(transcript)
return f"<html><body>{html}</body></html>"
elif output_format == "json":
# Structured format for APIs
return json.dumps({
"transcript": transcript,
"action_items": action_items["action_items"]
}, indent=2)
elif output_format == "outlook_calendar":
# Generate calendar entries for action items
import datetime
calendar_entries = []
for item in action_items["action_items"]:
# Parse deadline (stub: assumes "YYYY-MM-DD" or "TBD")
if item["deadline"] != "TBD":
due_date = datetime.datetime.strptime(item["deadline"], "%Y-%m-%d")
calendar_entries.append({
"subject": f"{item['task']} ({item['owner']})",
"start": due_date.isoformat(),
"categories": ["Meeting Action Item"]
})
return json.dumps(calendar_entries, indent=2)
Cost and Performance Metrics
A typical meeting pipeline costs:
| Step | Cost per Hour | Time |
|---|---|---|
| Whisper Transcription | $0.36 (60 min × $0.006/min) | 30–60 sec |
| Diarization (Pyannote) | Free (local) | 2–5 min |
| LLM Refinement (Claude) | $0.10–0.30 | 5–10 sec |
| Summary + Action Extraction | $0.15–0.30 | 10–15 sec |
| Total | $0.61–0.96 | 3–6 min |
For 100 meetings/month (100 hours), budget $60–100 for API costs + 5–10 hours of server time. On-device models (Wav2Vec2, Pyannote) eliminate cloud costs but add infrastructure complexity.
Key Takeaways
- An end-to-end meeting pipeline combines Whisper, diarization, alignment, and LLM refinement to produce searchable, actionable meeting notes.
- Preprocessing (denoising, normalization) improves accuracy on noisy Zoom/Teams recordings by 10–15%.
- LLM post-processing extracts structured data (action items, decisions) from transcripts using prompt engineering.
- Production systems must handle multiple speakers, interruptions, and technical jargon gracefully.
- Total cost per meeting is $0.60–1.00; total time is 3–6 minutes for a 1-hour meeting.
Frequently Asked Questions
How accurate is the end-to-end pipeline on real Zoom recordings?
Accuracy depends on audio quality. High-quality recordings (good mics, low noise) achieve 90–95% transcript accuracy and 85–95% diarization accuracy. Noisy recordings (echo, background noise) drop to 70–85% accuracy. Always review action items manually.
Can I process 10+ speaker meetings?
Diarization accuracy degrades with more speakers (beyond 4–5). For 10+ speakers, consider speaker enrollment (known speakers) or manual labeling of difficult sections.
How long does a complete pipeline run?
Whisper: 30–60 seconds per hour of audio (plus network latency). Diarization: 2–5 minutes. LLM refinement: 10–20 seconds. Total: 3–6 minutes for a 1-hour meeting on CPU. Use GPU for faster LLM inference (2–3x speedup).
Can I run this pipeline locally (no cloud APIs)?
Whisper, Pyannote, and Wav2Vec2 all have local implementations. You'll need 16 GB RAM and a modern GPU (8 GB+) for realtime or near-realtime performance. LLM refinement requires either a local model (Llama 2, Mistral) or an API.
How do I handle participants joining/leaving during a call?
Diarization treats join/leave as speaker changes. The transcript will label the returning speaker as a different "Speaker X", which can be confusing. Manually merge speaker segments for participants who rejoin, or use speaker enrollment to identify returning speakers.
Further Reading
- Pyannote Speaker Diarization for Meetings — Best practices for multi-speaker scenarios.
- Meeting Summarization: Past, Present, and Future — Research on summarization from speech.
- OpenAI Whisper Architecture — Technical details of the ASR model.
- Prompt Engineering for Structured Data Extraction — LLM prompt patterns.