Whisper API Tutorial: Transcribe Audio Files
OpenAI's Whisper is a state-of-the-art speech recognition model trained on 680,000 hours of multilingual audio from the web. The Whisper API abstracts away acoustic and language models—you send audio and get back a transcript—making it the fastest way to add transcription to any application. Whisper handles 99 languages, automatically detects language, and gracefully handles accents, technical jargon, and background noise without preprocessing. For most teams, the Whisper API is the right starting point: it achieves 4–8% WER on English, costs $0.006 per minute, and requires zero training or tuning.
Setting Up the Whisper API
Create a free OpenAI account and generate an API key from the OpenAI dashboard. Install the official Python client:
pip install openai
Then authenticate and test your setup:
from openai import OpenAI
client = OpenAI(api_key="sk-your-key-here")
# Transcribe a local audio file
with open("meeting.mp3", "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
language="en"
)
print(transcript.text)
The API accepts MP3, MP4, MPEG, MPGA, M4A, WAV, and WEBM files up to 25 MB. If your file is larger, split it into chunks (Whisper handles 25 MB per request) or compress it before uploading. The language parameter is optional; omit it to let Whisper auto-detect the language. Response time is typically 10–30 seconds per minute of audio.
Advanced Parameters: Timestamps, Prompt Priming, and Temperature
Beyond the basic transcription, Whisper supports several tuning parameters that improve accuracy and control output format:
# Transcript with word-level timestamps
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="verbose_json"
)
print(transcript.text)
# Also includes: words with start/end times, language, duration
# Prompt priming: guide Whisper toward domain-specific vocabulary
prompt = "The meeting discusses OpenAI API pricing, latency budgets, and cost optimization."
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
prompt=prompt,
temperature=0.0 # Lower temperature = more confident, less creative
)
# Typical use case: technical terms, proper nouns, product names
# Whisper will be more likely to produce the terms mentioned in your prompt.
The response_format parameter controls output structure:
"json"(default): Returns a dict withtextkey."verbose_json": Includestext,language,duration, andwords(with timestamps)."srt"or"vtt": Subtitle formats with timing information.
The prompt parameter (up to 224 tokens) acts as a hint: if you list terms from your domain (e.g., "Kubernetes, containerization, orchestration"), Whisper weights those words higher during decoding. The temperature parameter (0.0–1.0) controls output randomness; 0.0 gives deterministic, low-confidence results; 1.0 is more creative. For transcription, use 0.0–0.3.
Batch Processing: Handling Large Audio Corpora
Processing hundreds of audio files efficiently requires batching and error handling:
import os
import json
from openai import OpenAI
client = OpenAI()
def transcribe_directory(dir_path, output_file):
"""Transcribe all .mp3 files in a directory, logging results."""
results = {}
audio_files = [f for f in os.listdir(dir_path) if f.endswith('.mp3')]
for idx, filename in enumerate(audio_files):
file_path = os.path.join(dir_path, filename)
try:
with open(file_path, "rb") as audio_file:
print(f"Processing {idx+1}/{len(audio_files)}: {filename}...")
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="verbose_json"
)
results[filename] = {
"text": transcript.text,
"language": transcript.language,
"duration": transcript.duration
}
except Exception as e:
results[filename] = {"error": str(e)}
# Save results to JSON for later processing
with open(output_file, "w") as f:
json.dump(results, f, indent=2)
return results
# Usage
transcribe_directory("./audio_files", "transcripts.json")
This pattern saves results to JSON, enabling restart on failure and downstream post-processing. For production pipelines, add retries with exponential backoff (the OpenAI client retries automatically 3 times by default), monitor API quota, and log errors for manual review. Cost is $0.006 per minute of audio, so 1 hour of audio costs $0.36.
Handling Multiple Languages and Language Detection
Whisper automatically detects language if you omit the language parameter. For multilingual corpora, let Whisper detect:
# Auto-detect language
with open("spanish_audio.wav", "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file
# No 'language' parameter = auto-detection
)
print(f"Detected language: {transcript.language}")
# Output: language='es'
Specifying language improves accuracy by 2–5% if you know it in advance. Whisper handles code-switching (mixing multiple languages) reasonably well but may hallucinate text if audio is severely degraded. For truly multilingual applications, consider running separate models per language or using a dedicated language classifier first.
Processing Streaming Audio with the Whisper API
The Whisper API is designed for batch/file uploads, not live streaming. For real-time transcription, you must buffer audio into chunks and submit periodically:
import io
import time
# Simulate streaming audio (e.g., from a microphone or WebSocket)
def stream_and_transcribe(audio_stream, chunk_duration=30):
"""Transcribe audio chunks as they arrive."""
buffer = io.BytesIO()
start_time = time.time()
for chunk in audio_stream:
buffer.write(chunk)
elapsed = time.time() - start_time
# Every 30 seconds, transcribe the accumulated buffer
if elapsed >= chunk_duration:
buffer.seek(0)
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=buffer,
language="en"
)
print(f"Chunk transcript: {transcript.text}")
# Reset buffer for next chunk
buffer = io.BytesIO()
start_time = time.time()
# Note: This approach has latency (30-60 second delay).
# For true real-time streaming with <1 second latency, use a dedicated
# streaming ASR service (Google Cloud Speech-to-Text, Azure Speech Services).
This pattern buffers audio for 30 seconds before transcribing, introducing latency. If sub-second latency is required, use streaming-specific APIs. The Whisper API is best suited for recording-based workflows (phone calls, meetings, podcasts) where 10–30 second latency is acceptable.
Error Handling and Cost Optimization
Common issues and solutions:
from openai import OpenAI, RateLimitError
client = OpenAI()
def transcribe_with_retry(file_path, max_retries=3):
"""Transcribe with automatic retry and error logging."""
for attempt in range(max_retries):
try:
with open(file_path, "rb") as audio_file:
return client.audio.transcriptions.create(
model="whisper-1",
file=audio_file
)
except RateLimitError:
if attempt < max_retries - 1:
wait_time = 2 ** attempt # Exponential backoff: 1s, 2s, 4s
print(f"Rate limit hit. Retrying in {wait_time}s...")
time.sleep(wait_time)
else:
raise
except FileNotFoundError:
print(f"File not found: {file_path}")
return None
except Exception as e:
print(f"Unexpected error: {e}")
return None
To optimize costs, compress audio before uploading (MP3 compression reduces file size 10–20x), batch requests during off-peak hours, and cache transcripts for identical audio. For large-scale deployments, consider fine-tuning a smaller open-source model (Wav2Vec, Conformer) to run on-device.
Key Takeaways
- Whisper API transcribes 99 languages with 4–8% WER on English; costs $0.006/minute and requires no training.
- The
languageandpromptparameters guide transcription; useresponse_format="verbose_json"for word-level timestamps. - Batch processing with error handling and retry logic scales to thousands of files; results should be saved to JSON for auditing.
- Whisper is ideal for recording-based workflows; for real-time streaming with <1 second latency, use dedicated streaming ASR APIs.
- Compress audio, cache results, and monitor API quotas to optimize costs and reliability in production.
Frequently Asked Questions
What file formats does Whisper API support?
Whisper accepts MP3, MP4, MPEG, MPGA, M4A, WAV, and WEBM up to 25 MB each. If your file exceeds 25 MB, split it into smaller chunks or compress it (MP3 is highly recommended for cost).
How accurate is Whisper compared to human transcription?
Whisper achieves 4–8% WER on English broadcast news, comparable to professional transcriptionists. On noisy recordings or non-native speakers, WER rises to 15–25%. Accuracy depends heavily on audio quality, language, and domain; always test on your data.
Can Whisper transcribe audio in real-time (sub-second latency)?
No. Whisper is a batch API with 10–30 second latency per audio chunk. For true real-time captions (100–500 ms latency), use Google Cloud Speech-to-Text or Azure Speech Services, which support streaming APIs.
How do I improve accuracy for domain-specific terminology?
Use the prompt parameter to list technical terms, product names, or proper nouns. Whisper weights those terms higher during decoding. For critical applications, combine Whisper with LLM post-processing to fix technical errors automatically.
What is the cost of Whisper API at scale?
$0.006 per minute of audio. One hour of audio costs $0.36. For 10,000 hours/year, budget $3,600/year for Whisper alone. On-device models or fine-tuned open-source alternatives can reduce costs for high-volume deployments.
Further Reading
- OpenAI Whisper API Documentation — Official guide with parameters and examples.
- Whisper GitHub Repository — Open-source model, local inference, and fine-tuning.
- Audio Processing with Python Librosa — Audio loading, resampling, and feature extraction.
- NIST Speech Recognition WER Benchmarks — Standard evaluation datasets and baselines.