Skip to main content

Noise Reduction for Audio Clarity

Noise reduction (audio denoising) preprocesses raw audio to remove background noise, improving downstream ASR accuracy by 10–30%. Noisy recordings—Zoom calls with fan hum, meetings with traffic, or mobile audio with wind—can degrade ASR word error rate from 5% to 20% or worse. Denoising methods range from simple spectral subtraction (remove estimated noise spectrum from speech spectrum) to modern deep learning models trained to reconstruct clean speech from noisy input. For most applications, spectral subtraction or pre-trained denoising models (like Meta's Demucs or Nvidia's Nemo-HC) are sufficient; custom fine-tuning is rarely necessary.

Spectral Subtraction: Fast and Simple Noise Removal

Spectral subtraction assumes that noise is additive in the frequency domain: speech + noise = noisy_speech. If you can estimate the noise spectrum (from a silent region at the start of the recording), you can subtract it from the noisy spectrum and recover approximate clean speech. The result has some artifacts (musical noise), but speech intelligibility improves.

import librosa
import numpy as np
import soundfile as sf

def spectral_subtraction(audio_path, noise_duration=1.0, subtraction_factor=1.0, output_path=None):
"""
Reduce noise using spectral subtraction.

audio_path: Path to noisy audio
noise_duration: Duration (seconds) of silence/noise at start to estimate noise
subtraction_factor: How aggressively to subtract (1.0 = exact, >1.0 = over-subtract)
"""
# Load audio
y, sr = librosa.load(audio_path, sr=16000)

# Extract STFT (Short-Time Fourier Transform)
D = librosa.stft(y)
magnitude = np.abs(D)
phase = np.angle(D)

# Estimate noise spectrum from first noise_duration seconds
noise_sample_frames = int(noise_duration * sr / 512) # 512 is hop_length
noise_spectrum = np.mean(magnitude[:, :noise_sample_frames], axis=1, keepdims=True)

# Spectral subtraction
cleaned_magnitude = magnitude - subtraction_factor * noise_spectrum
cleaned_magnitude = np.maximum(cleaned_magnitude, 0.1 * magnitude) # Prevent negative values

# Reconstruct audio from cleaned magnitude and original phase
D_cleaned = cleaned_magnitude * np.exp(1j * phase)
y_cleaned = librosa.istft(D_cleaned)

# Save or return
if output_path:
sf.write(output_path, y_cleaned, sr)

return y_cleaned, sr

# Usage
y_clean, sr = spectral_subtraction("noisy_call.wav", noise_duration=1.0, subtraction_factor=1.2)

# Transcribe the cleaned audio with Whisper
from openai import OpenAI
import soundfile as sf

# Save temporarily
sf.write("temp_clean.wav", y_clean, sr)

client = OpenAI()
with open("temp_clean.wav", "rb") as f:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=f
)

print(f"Cleaned transcript: {transcript.text}")

Spectral subtraction is fast (runs in milliseconds) and requires no training. Downsides: it introduces musical noise artifacts (tones that weren't in the original) and doesn't work on highly non-stationary noise (e.g., speech-like noise). Adjust subtraction_factor to balance noise removal vs artifacts: 1.0 is conservative, 1.5–2.0 is aggressive.

Wiener Filter: Statistically Optimal Denoising

Wiener filtering minimizes mean-squared error between the denoised signal and the original clean signal (assuming you have clean signal statistics). It's statistically optimal but requires knowledge of signal and noise characteristics. For ASR preprocessing, a simplified version works well:

def wiener_filter(audio_path, noise_duration=1.0, output_path=None):
"""
Apply Wiener filter for noise reduction.
"""
y, sr = librosa.load(audio_path, sr=16000)

# Extract spectral features
D = librosa.stft(y)
magnitude = np.abs(D)
phase = np.angle(D)

# Estimate noise spectrum
noise_frames = int(noise_duration * sr / 512)
noise_power = np.mean(magnitude[:, :noise_frames] ** 2, axis=1, keepdims=True)

# Estimate signal power (full audio)
signal_power = np.mean(magnitude ** 2, axis=1, keepdims=True)

# Wiener gain: G = signal_power / (signal_power + noise_power)
# This is 1 for pure signal, 0 for pure noise
wiener_gain = signal_power / (signal_power + noise_power + 1e-6)

# Apply gain
cleaned_magnitude = magnitude * wiener_gain

# Reconstruct
D_cleaned = cleaned_magnitude * np.exp(1j * phase)
y_cleaned = librosa.istft(D_cleaned)

if output_path:
import soundfile as sf
sf.write(output_path, y_cleaned, sr)

return y_cleaned, sr

# Usage
y_clean, sr = wiener_filter("noisy_call.wav", noise_duration=1.0)

Wiener filtering is more sophisticated than spectral subtraction and typically produces fewer artifacts. However, it assumes stationary noise (constant over time), which may not hold in real-world recordings.

Deep Learning Denoising: Demucs and Pre-trained Models

For state-of-the-art results, use pre-trained deep learning models trained to separate speech from noise. Meta's Demucs (originally a music source separation model) works surprisingly well on speech. Nvidia's Nemo also includes denoising models.

Install Demucs:

pip install demucs

Then use:

import subprocess
import os

def demucs_denoise(audio_path, output_dir="denoised"):
"""Denoise audio using Demucs."""
os.makedirs(output_dir, exist_ok=True)

# Run demucs on the audio
# Demucs outputs multiple "stems" (vocals, drums, bass, other)
# We use the "vocals" stem as the denoised speech
subprocess.run([
"demucs",
"-n", "mdx_extra", # Best model for speech
"-o", output_dir,
audio_path
])

# Extract the vocals stem (most relevant for speech)
basename = os.path.basename(audio_path).replace(".wav", "")
vocals_path = os.path.join(output_dir, "mdx_extra", basename, "vocals.wav")

return vocals_path

# Usage
clean_audio_path = demucs_denoise("noisy_meeting.wav")

# Transcribe
from openai import OpenAI
client = OpenAI()

with open(clean_audio_path, "rb") as f:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=f
)

print(transcript.text)

Demucs produces high-quality denoising with minimal artifacts, but runs slower than spectral subtraction (10–30 seconds for 1-hour audio on CPU). For batch processing, it's acceptable; for real-time streaming, it's prohibitive.

Combining Multiple Denoising Techniques

For maximum effectiveness, combine techniques: spectral subtraction for fast, lightweight preprocessing, then feed to Whisper (which has some built-in noise robustness). For critical applications, run Demucs on high-value files.

import librosa
import soundfile as sf
from openai import OpenAI

def denoise_and_transcribe(audio_path, method="spectral"):
"""
Denoise then transcribe.
method: 'spectral', 'wiener', or 'demucs'
"""
if method == "spectral":
y_clean, sr = spectral_subtraction(audio_path, noise_duration=1.0)
elif method == "wiener":
y_clean, sr = wiener_filter(audio_path, noise_duration=1.0)
elif method == "demucs":
clean_path = demucs_denoise(audio_path)
y_clean, sr = librosa.load(clean_path, sr=16000)
else:
y_clean, sr = librosa.load(audio_path, sr=16000)

# Save cleaned audio temporarily
temp_path = f"temp_{method}_clean.wav"
sf.write(temp_path, y_clean, sr)

# Transcribe
client = OpenAI()
with open(temp_path, "rb") as f:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=f
)

return transcript.text

# Compare methods
for method in ["spectral", "wiener", "demucs"]:
text = denoise_and_transcribe("noisy_zoom_call.wav", method=method)
print(f"{method}: {text[:100]}...")

In practice, spectral subtraction or Wiener filtering for lightweight preprocessing + Whisper is sufficient for most use cases. Use Demucs only if Whisper's output is still error-ridden.

Evaluating Denoising Quality: Metrics and Listening Tests

Denoising quality is subjective (different people prefer different amounts of artifacts). For ASR, the metric that matters is transcription accuracy after denoising. Conduct A/B tests:

def compare_asr_quality(noisy_audio_path):
"""Compare ASR accuracy before and after denoising."""
client = OpenAI()

# Baseline: transcribe noisy audio directly
with open(noisy_audio_path, "rb") as f:
noisy_transcript = client.audio.transcriptions.create(
model="whisper-1",
file=f
).text

# Denoised: spectral subtraction then transcribe
y_clean, sr = spectral_subtraction(noisy_audio_path, noise_duration=1.0)
sf.write("temp_denoised.wav", y_clean, sr)

with open("temp_denoised.wav", "rb") as f:
clean_transcript = client.audio.transcriptions.create(
model="whisper-1",
file=f
).text

print("NOISY:")
print(noisy_transcript)
print("\nDENOISED (spectral):")
print(clean_transcript)

# If you have a reference transcript, compute word error rate
# reference = "the meeting is at 2pm tomorrow"
# noisy_wer = wer(reference, noisy_transcript)
# clean_wer = wer(reference, clean_transcript)
# print(f"\nWER - noisy: {noisy_wer:.1%}, denoised: {clean_wer:.1%}")

Run this comparison on a few samples from your use case. If denoising improves WER by >5%, it's worth adding to your pipeline.

Key Takeaways

  • Noise reduction preprocesses audio to improve ASR accuracy by 10–30%, especially on noisy real-world recordings (calls, meetings, mobile audio).
  • Spectral subtraction is fast and requires no training; Wiener filtering is more sophisticated; deep learning (Demucs) is highest quality but slowest.
  • Always estimate the noise spectrum from silent regions; adjust subtraction factor to balance noise removal vs artifacts.
  • For ASR, combine lightweight denoising (spectral subtraction) with Whisper, which has built-in noise robustness.
  • Evaluate denoising by measuring improvement in downstream ASR accuracy (WER), not by perceptual quality alone.

Frequently Asked Questions

How much does denoising actually improve ASR accuracy?

On noisy audio (SNR 10–20 dB), denoising typically improves WER by 10–30%. On clean audio, denoising may hurt slightly (introduces artifacts). Always test on your data.

What if the audio has no silent region to estimate noise?

If the entire recording is speech with no silence, estimate noise from the quietest frequency band or the lowest-energy time windows. Alternatively, use supervised denoising (Demucs) which doesn't require noise estimation.

Can I denoise in real-time (streaming)?

Spectral subtraction can run in real-time on 10 ms chunks. Wiener filtering has moderate latency. Demucs requires buffering (runs on multi-second chunks) and is not suitable for sub-100 ms latency streaming.

Does denoising help with accents or non-English speech?

Denoising improves intelligibility for all languages. However, if the accent is the primary issue (not noise), denoising won't help much. Pair denoising with language-specific ASR models.

What is the computational cost of denoising?

Spectral subtraction: <1 ms per second of audio (negligible). Wiener filter: 1–5 ms per second. Demucs: 10–100 ms per second of audio (slow on CPU; fast on GPU). For batch processing, Demucs is fine; for real-time, use spectral or Wiener.

Further Reading