Skip to main content

Meeting Transcription and Summarization

Converting a meeting recording into a searchable, speaker-labeled transcript with a one-page summary and action items is an end-to-end audio AI application. The pipeline combines ASR (speech-to-text), diarization (speaker identification), alignment (word-level timestamps), and LLM prompt engineering (summary and action extraction). Production systems must handle Zoom/Teams recordings, multiple speakers, interruptions, and technical jargon while producing output readable within seconds. Building this pipeline from scratch involves orchestrating five separate models; using it as a reference workflow teaches the full stack of audio AI techniques covered in this series.

Reference: Complete Meeting-to-Summary Pipeline

Here's a working pipeline that transcribes a Zoom recording, diarizes speakers, aligns timestamps, refines text, and extracts action items:

import os
import json
import librosa
import soundfile as sf
from openai import OpenAI
from anthropic import Anthropic
from pyannote.audio import Pipeline as DiarizationPipeline

def meeting_to_summary(meeting_audio_path, output_dir="meeting_output"):
"""
Complete meeting transcription + diarization + summarization pipeline.

Output:
- meeting_output/raw_transcript.json (words with timestamps and speakers)
- meeting_output/refined_transcript.md (speaker-labeled cleaned text)
- meeting_output/summary.md (1-page executive summary)
- meeting_output/action_items.json (structured action items)
"""
os.makedirs(output_dir, exist_ok=True)

# Step 1: Transcribe with Whisper + timestamps
print("Step 1: Transcribing audio...")
openai_client = OpenAI()

with open(meeting_audio_path, "rb") as f:
whisper_response = openai_client.audio.transcriptions.create(
model="whisper-1",
file=f,
response_format="verbose_json"
)

words_with_timestamps = [{
"word": w.word,
"start": w.start,
"end": w.end
} for w in whisper_response.words]

# Step 2: Diarize (identify speakers)
print("Step 2: Identifying speakers...")
diarization_pipeline = DiarizationPipeline.from_pretrained(
"pyannote/speaker-diarization-3.1"
)
diarization = diarization_pipeline(meeting_audio_path)

# Extract speaker segments
speaker_segments = []
for turn, _, speaker in diarization.itertracks(yield_label=True):
speaker_segments.append({
"start": turn.start,
"end": turn.end,
"speaker": speaker
})

# Step 3: Align speakers with words
print("Step 3: Aligning speakers with words...")
for word in words_with_timestamps:
word["speaker"] = "Unknown"
for segment in speaker_segments:
if segment["start"] <= word["start"] < segment["end"]:
word["speaker"] = segment["speaker"]
break

# Save raw transcript
with open(f"{output_dir}/raw_transcript.json", "w") as f:
json.dump(words_with_timestamps, f, indent=2)

# Step 4: Refine transcript with LLM
print("Step 4: Refining transcript...")
anthropic_client = Anthropic()

# Group words by speaker turn
turns = []
current_turn = {"speaker": words_with_timestamps[0]["speaker"], "words": []}
for word in words_with_timestamps:
if word["speaker"] != current_turn["speaker"]:
turns.append(current_turn)
current_turn = {"speaker": word["speaker"], "words": []}
current_turn["words"].append(word["word"])
turns.append(current_turn)

# Format for refinement
raw_text = ""
for turn in turns:
raw_text += f"{turn['speaker']}: {' '.join(turn['words'])}\n"

# Refine using Claude
refinement_prompt = f"""You are a professional meeting transcription editor. Refine this meeting transcript:
- Fix punctuation and capitalization
- Correct homophones and obvious ASR errors
- Preserve speaker names and labels
- Add paragraph breaks for readability
- Remove filler words (um, uh, like) unless significant

TRANSCRIPT:
{raw_text}

REFINED TRANSCRIPT:"""

refinement_response = anthropic_client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=8000,
messages=[{"role": "user", "content": refinement_prompt}]
)

refined_transcript = refinement_response.content[0].text

with open(f"{output_dir}/refined_transcript.md", "w") as f:
f.write(refined_transcript)

# Step 5: Generate summary and extract action items
print("Step 5: Generating summary and action items...")

summary_prompt = f"""Analyze this meeting transcript and generate:
1. A 1-page executive summary (key decisions and topics)
2. Structured action items (task, owner, deadline if mentioned)
3. Questions or blockers raised

Format your response as:

## EXECUTIVE SUMMARY
[2-3 paragraphs summarizing key decisions, topics, and outcomes]

## ACTION ITEMS
- Task: [task description]
Owner: [person responsible]
Deadline: [if mentioned, otherwise "TBD"]
- ...

## BLOCKERS & QUESTIONS
[Any open questions or blockers mentioned]

TRANSCRIPT:
{refined_transcript}"""

summary_response = anthropic_client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=2000,
messages=[{"role": "user", "content": summary_prompt}]
)

summary_text = summary_response.content[0].text

with open(f"{output_dir}/summary.md", "w") as f:
f.write(summary_text)

# Extract action items as JSON
extraction_prompt = f"""From this meeting transcript, extract action items as JSON.
Respond ONLY with JSON in this format:
{{
"action_items": [
{{"task": "...", "owner": "...", "deadline": "..."}},
...
]
}}

TRANSCRIPT:
{refined_transcript}"""

extraction_response = anthropic_client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1000,
messages=[{"role": "user", "content": extraction_prompt}]
)

try:
action_items_json = json.loads(extraction_response.content[0].text)
except json.JSONDecodeError:
# Fallback if JSON parsing fails
action_items_json = {"action_items": []}

with open(f"{output_dir}/action_items.json", "w") as f:
json.dump(action_items_json, f, indent=2)

print(f"\nOutput saved to {output_dir}/")
return {
"transcript_path": f"{output_dir}/refined_transcript.md",
"summary_path": f"{output_dir}/summary.md",
"action_items_path": f"{output_dir}/action_items.json"
}

# Usage
if __name__ == "__main__":
results = meeting_to_summary("zoom_recording.wav")

# Display results
with open(results["summary_path"]) as f:
print("SUMMARY:\n" + f.read())

with open(results["action_items_path"]) as f:
items = json.load(f)
print("\nACTION ITEMS:")
for item in items["action_items"]:
print(f" - {item['task']} (Owner: {item['owner']}, Due: {item['deadline']})")

This pipeline produces meeting minutes suitable for immediate distribution to attendees.

Handling Meeting-Specific Challenges

Real meetings have interruptions, overlapping speech, technical jargon, and participants who dial in from noisy locations. Handle these:

def preprocess_meeting_audio(audio_path):
"""
Preprocess Zoom/Teams recordings to improve transcription.
"""
y, sr = librosa.load(audio_path, sr=16000)

# Normalize volume (many Zoom recordings have variable levels)
from librosa.effects import normalize
y = normalize(y)

# Apply light denoising (spectral subtraction)
D = librosa.stft(y)
magnitude = np.abs(D)
phase = np.angle(D)

# Estimate noise from first 1 second (usually silence/low speech)
noise_frames = int(16000 / 512) # ~1 second
noise_spectrum = np.mean(magnitude[:, :noise_frames], axis=1, keepdims=True)

# Subtract noise lightly (factor 0.8 to preserve speech)
cleaned_magnitude = magnitude - 0.8 * noise_spectrum
cleaned_magnitude = np.maximum(cleaned_magnitude, 0.1 * magnitude)

D_cleaned = cleaned_magnitude * np.exp(1j * phase)
y_cleaned = librosa.istft(D_cleaned)

# Save preprocessed audio
sf.write("preprocessed_meeting.wav", y_cleaned, sr)

return "preprocessed_meeting.wav"

# For especially noisy meetings, use deeper preprocessing
def denoise_with_demucs(audio_path):
"""Use Demucs for stronger denoising on noisy calls."""
import subprocess
subprocess.run([
"demucs",
"-n", "mdx_extra",
"-o", "denoised",
audio_path
])
return "denoised/mdx_extra/meeting/vocals.wav"

Generating Meeting Notes in Different Formats

Output meeting notes in multiple formats for different use cases:

def export_meeting_output(refined_transcript_path, action_items_path, output_format="markdown"):
"""
Export meeting notes in different formats.
Formats: 'markdown', 'html', 'json', 'outlook_calendar'
"""
with open(refined_transcript_path) as f:
transcript = f.read()

with open(action_items_path) as f:
action_items = json.load(f)

if output_format == "markdown":
# Already generated
return transcript

elif output_format == "html":
# Convert to HTML for email
import markdown
html = markdown.markdown(transcript)
return f"<html><body>{html}</body></html>"

elif output_format == "json":
# Structured format for APIs
return json.dumps({
"transcript": transcript,
"action_items": action_items["action_items"]
}, indent=2)

elif output_format == "outlook_calendar":
# Generate calendar entries for action items
import datetime
calendar_entries = []

for item in action_items["action_items"]:
# Parse deadline (stub: assumes "YYYY-MM-DD" or "TBD")
if item["deadline"] != "TBD":
due_date = datetime.datetime.strptime(item["deadline"], "%Y-%m-%d")
calendar_entries.append({
"subject": f"{item['task']} ({item['owner']})",
"start": due_date.isoformat(),
"categories": ["Meeting Action Item"]
})

return json.dumps(calendar_entries, indent=2)

Cost and Performance Metrics

A typical meeting pipeline costs:

StepCost per HourTime
Whisper Transcription$0.36 (60 min × $0.006/min)30–60 sec
Diarization (Pyannote)Free (local)2–5 min
LLM Refinement (Claude)$0.10–0.305–10 sec
Summary + Action Extraction$0.15–0.3010–15 sec
Total$0.61–0.963–6 min

For 100 meetings/month (100 hours), budget $60–100 for API costs + 5–10 hours of server time. On-device models (Wav2Vec2, Pyannote) eliminate cloud costs but add infrastructure complexity.

Key Takeaways

  • An end-to-end meeting pipeline combines Whisper, diarization, alignment, and LLM refinement to produce searchable, actionable meeting notes.
  • Preprocessing (denoising, normalization) improves accuracy on noisy Zoom/Teams recordings by 10–15%.
  • LLM post-processing extracts structured data (action items, decisions) from transcripts using prompt engineering.
  • Production systems must handle multiple speakers, interruptions, and technical jargon gracefully.
  • Total cost per meeting is $0.60–1.00; total time is 3–6 minutes for a 1-hour meeting.

Frequently Asked Questions

How accurate is the end-to-end pipeline on real Zoom recordings?

Accuracy depends on audio quality. High-quality recordings (good mics, low noise) achieve 90–95% transcript accuracy and 85–95% diarization accuracy. Noisy recordings (echo, background noise) drop to 70–85% accuracy. Always review action items manually.

Can I process 10+ speaker meetings?

Diarization accuracy degrades with more speakers (beyond 4–5). For 10+ speakers, consider speaker enrollment (known speakers) or manual labeling of difficult sections.

How long does a complete pipeline run?

Whisper: 30–60 seconds per hour of audio (plus network latency). Diarization: 2–5 minutes. LLM refinement: 10–20 seconds. Total: 3–6 minutes for a 1-hour meeting on CPU. Use GPU for faster LLM inference (2–3x speedup).

Can I run this pipeline locally (no cloud APIs)?

Whisper, Pyannote, and Wav2Vec2 all have local implementations. You'll need 16 GB RAM and a modern GPU (8 GB+) for realtime or near-realtime performance. LLM refinement requires either a local model (Llama 2, Mistral) or an API.

How do I handle participants joining/leaving during a call?

Diarization treats join/leave as speaker changes. The transcript will label the returning speaker as a different "Speaker X", which can be confusing. Manually merge speaker segments for participants who rejoin, or use speaker enrollment to identify returning speakers.

Further Reading