Skip to main content

Multi-Language Speech Recognition Pipeline

Multi-language ASR expands transcription beyond English to 50+ languages, enabling global applications. The challenge: identifying which language is being spoken, selecting the right acoustic and language models, and handling code-switching (mixing languages within a single utterance). Whisper natively supports 99 languages and automatically detects language from audio, making it the easiest starting point. For higher accuracy on specific languages or code-switched audio, fine-tuned models (Wav2Vec2, Conformer) trained on in-language data outperform multilingual models. Production systems often combine language detection, language-specific ASR, and LLM post-processing to handle diverse global content.

Language Detection: Identifying Spoken Language

Before transcription, identify the language to select the right model. Language identification (LID) models are tiny neural networks (500 KB–5 MB) that listen to 1–3 seconds of audio and output a probability distribution over languages.

import librosa
import torch
from transformers import pipeline

def detect_language_from_audio(audio_path):
"""
Detect the language of spoken audio using a pre-trained LID model.
"""
# Initialize the language identification pipeline
# This uses Facebook's XLSR model, trained on 128 languages
lid_pipeline = pipeline(
"audio-classification",
model="facebook/xlsr-wav2vec2-large-53-languages"
)

# Load audio (max 10 seconds)
y, sr = librosa.load(audio_path, sr=16000, duration=10)

# Run LID
results = lid_pipeline(y, sampling_rate=sr, top_k=5)

# Results is a list of {"score": float, "label": "lang_code"}
print("Language detection results:")
for result in results:
print(f" {result['label']}: {result['score']:.2%}")

# Return top language
top_language = results[0]["label"]
# Labels are like "eng" (English), "fra" (French), "deu" (German), "hin" (Hindi)

return top_language

# Usage
lang = detect_language_from_audio("unknown_language_audio.wav")
print(f"Detected language: {lang}")

LID accuracy is typically 90–95% for single-language audio. For strongly accented speech or short clips (< 1 second), confidence drops. Always allow manual override in UI.

Whisper for Multi-Language Transcription

Whisper automatically detects language but allows explicit specification for control:

from openai import OpenAI

def transcribe_multilingual(audio_path, language_code=None):
"""
Transcribe audio in any Whisper-supported language.

language_code: ISO-639-1 code (e.g., 'es', 'fr', 'ja') or None for auto-detect
"""
client = OpenAI()

with open(audio_path, "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
language=language_code # Optional; omit for auto-detection
)

return transcript.text

# Examples
print(transcribe_multilingual("spanish_conversation.wav", language_code="es"))
print(transcribe_multilingual("mandarin_lecture.wav", language_code="zh"))
print(transcribe_multilingual("hindi_interview.wav")) # Auto-detect

# Whisper supports 99 languages with good coverage of major languages
# Quality is best for: English, Spanish, French, German, Chinese, Japanese, Hindi
# Less reliable for: Low-resource languages, heavy accents, technical jargon

Whisper's multilingual accuracy varies: 5–10% WER on major languages (English, Spanish, Mandarin), 15–25% WER on less common languages. For maximum accuracy, fine-tune on in-language data or use language-specific models.

Language-Specific ASR Models: Fine-Tuning for Accuracy

For critical applications or low-resource languages, fine-tune Wav2Vec2 or Conformer on in-language data:

from datasets import load_dataset
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC, TrainingArguments, Trainer
import torch

def fine_tune_wav2vec2_for_language(language_code, training_data_path, output_dir):
"""
Fine-tune Wav2Vec2 for a specific language using local training data.

language_code: e.g., 'hi' for Hindi
training_data_path: Directory of audio files and transcripts
"""
# Load pretrained Wav2Vec2 (multilingual base)
model_id = "facebook/wav2vec2-xls-r-300m"
processor = Wav2Vec2Processor.from_pretrained(model_id)
model = Wav2Vec2ForCTC.from_pretrained(model_id)

# Load your language-specific training data (stub)
# In practice, prepare a Hugging Face dataset with "audio" and "text" columns
train_dataset = load_dataset(
"json",
data_files={"train": "train_metadata.json"},
cache_dir="./cache"
)

def preprocess_function(batch):
audio_arrays = [x for x in batch["audio"]]
inputs = processor(audio_arrays, sampling_rate=16000, padding=True)
with processor.as_target_processor():
labels = processor(batch["text"], padding=True)
inputs["labels"] = labels["input_ids"]
return inputs

train_dataset = train_dataset.map(preprocess_function, batched=True)

# Training
training_args = TrainingArguments(
output_dir=output_dir,
overwrite_output_dir=True,
per_device_train_batch_size=16,
gradient_accumulation_steps=2,
learning_rate=1e-4,
num_train_epochs=10,
save_steps=100,
eval_steps=100,
logging_steps=25,
evaluation_strategy="steps",
save_strategy="steps",
load_best_model_at_end=True,
)

trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
processing_class=processor,
)

trainer.train()

# Save fine-tuned model
model.save_pretrained(f"{output_dir}/{language_code}_final")
processor.save_pretrained(f"{output_dir}/{language_code}_final")

# Fine-tuning requires:
# - 10–100 hours of audio in your target language
# - Parallel transcripts (one per audio file)
# - ~8 GB GPU memory for Wav2Vec2-large
# - Training time: 2–24 hours depending on data size

Fine-tuning reduces WER by 10–30% on in-language data but requires substantial labeled data and computational resources.

Code-Switching: Multi-Language Within a Single Utterance

Code-switching (mixing languages, e.g., "Hello, como estás?") is common in multilingual communities. Standard models struggle because they expect monolingual input. Strategies:

  1. Use a code-switching-aware model: Some Whisper variants and Conformer models are trained on code-switched data.
  2. Segment and transcribe separately: Detect language boundaries, segment audio, and transcribe each segment with the appropriate model.
  3. Use an ensemble: Run multiple language-specific models and select the best output.
def handle_code_switching(audio_path):
"""
Handle code-switched audio by detecting language boundaries and transcribing per-segment.
"""
from openai import OpenAI
import librosa

client = OpenAI()
y, sr = librosa.load(audio_path, sr=16000)

# Naive segmentation: use silence to detect boundaries (stub)
# In practice, use a more sophisticated model to detect language switches
segments = segment_by_silence(y, sr, min_duration=0.5)

full_transcript = []

for segment_audio, segment_label in segments:
# Detect language for this segment
import soundfile as sf
temp_file = f"temp_segment_{segment_label}.wav"
sf.write(temp_file, segment_audio, sr)

# Detect language
lid_pipeline = pipeline("audio-classification",
model="facebook/xlsr-wav2vec2-large-53-languages")
lang_result = lid_pipeline(segment_audio, sampling_rate=sr, top_k=1)
language_code = lang_result[0]["label"]

# Transcribe with detected language
with open(temp_file, "rb") as f:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=f,
language=convert_lid_to_iso639(language_code) # Map to ISO code
)

full_transcript.append(transcript.text)

return " ".join(full_transcript)

def segment_by_silence(y, sr, min_duration=0.5):
"""Stub: Segment audio by silence."""
return [(y, "full")]

def convert_lid_to_iso639(lid_code):
"""Map LID language code (e.g., 'eng') to ISO 639-1 (e.g., 'en')."""
mapping = {
"eng": "en", "spa": "es", "fra": "fr", "deu": "de",
"hin": "hi", "zho": "zh", "jpn": "ja", "ara": "ar"
}
return mapping.get(lid_code, "en")

For complex code-switching, deploy a specialized model or manual post-processing.

Building a Production Multi-Language Pipeline

A robust pipeline combines language detection, model selection, and LLM refinement:

from openai import OpenAI
from anthropic import Anthropic
import librosa

def multilingual_asr_pipeline(audio_path):
"""
Complete multi-language ASR pipeline.
"""
# Step 1: Detect language
lang = detect_language_from_audio(audio_path)
print(f"Detected language: {lang}")

# Step 2: Transcribe with Whisper
client = OpenAI()
with open(audio_path, "rb") as f:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=f,
language=lang
)

raw_text = transcript.text
print(f"Raw transcript: {raw_text}")

# Step 3: Refine with LLM (if needed)
# For non-English, LLM refinement is less reliable but can fix some errors
anthropic_client = Anthropic()

prompt = f"""You are a transcription editor. The following is a {lang} transcript.
Fix any obvious errors, improve punctuation, and correct homophones:

TRANSCRIPT:
{raw_text}

REFINED TRANSCRIPT:"""

response = anthropic_client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=4096,
messages=[{"role": "user", "content": prompt}]
)

refined_text = response.content[0].text

return {
"language": lang,
"raw_transcript": raw_text,
"refined_transcript": refined_text
}

# Usage
result = multilingual_asr_pipeline("mystery_language_audio.wav")
print(f"Language: {result['language']}")
print(f"Refined: {result['refined_transcript']}")

Key Takeaways

  • Language identification (LID) models automatically detect spoken language with 90–95% accuracy; use for automatic model selection.
  • Whisper supports 99 languages with auto-detection; English and major languages achieve 5–10% WER, less common languages 15–25%.
  • Fine-tune Wav2Vec2 on language-specific data to reduce WER by 10–30%, but requires 10–100 hours of labeled audio.
  • Code-switching (language mixing) is challenging; segment by language boundaries and transcribe per-segment, or use code-switching-aware models.
  • Production pipelines combine language detection, language-specific transcription, and LLM refinement for global content.

Frequently Asked Questions

How accurate is language detection on code-switched audio?

LID models assume monolingual segments. On strongly code-switched audio (50% language A, 50% language B), accuracy drops to 60–80%. For extreme code-switching, manually override language selection or use a code-switching-specific model.

Which languages does Whisper handle best?

Whisper is most accurate on English, Spanish, French, German, Italian, Portuguese, Chinese, and Japanese. For less common languages, accuracy degrades 15–20%. Check the Whisper documentation for coverage of your target languages.

Can I use Whisper's output directly for non-English text without refinement?

Whisper is good enough for most applications without refinement. LLM refinement helps with punctuation and domain-specific errors, but the gains are smaller for non-English (5–10% vs 10–15% for English).

How much training data is needed to fine-tune for a new language?

Start with 10 hours (achieves 15–20% WER improvement). 100 hours can reduce WER by 30–40%. For production-grade systems, 1,000+ hours is ideal. Quality and transcription accuracy matter more than quantity.

Does Whisper handle accents within a language (e.g., British vs American English)?

Whisper generalizes well across accents because it was trained on diverse global data. Accent-specific fine-tuning (10–20 hours of accented data) can improve WER by 2–5% if accent is the primary issue.

Further Reading