Understanding Speech Recognition in Noisy Environments
Audio is one of the most natural interfaces for robotics. We speak to each other, so why not speak to machines? The problem is that the real world is noisy. Engines hum, crowds chatter, wind blows. If your robot can't understand you over background noise, that natural interface breaks down.
I wanted to understand how modern speech recognition handles this challenge. After exploring the Open ASR Leaderboard, I selected Parakeet, a model that strikes a nice balance between accuracy (low word error rate) and speed (real-time factor). I also dug into DeepFilterNet, a denoising model that cleans up audio before transcription.
This post covers what I learned: how to capture and corrupt audio, how these models work, and how to deploy them for real use.
Model Selection: The Open ASR Leaderboard
When choosing an ASR model, two metrics matter most: Word Error Rate (WER) and Real-Time Factor (RTFx). WER measures the number of incorrect words—lower is better. RTFx measures speed; an RTFx of 100 means one minute of audio is transcribed in 0.6 seconds.
The Open ASR Leaderboard makes it easy to compare these trade-offs across 60+ open-source and proprietary systems.
The accuracy vs speed trade-off
| Model | WER (%) | RTFx | Notes |
|---|---|---|---|
| Canary Qwen 2.5B | 5.63 | 418 | Most accurate, slower |
| Whisper Large v3 | 7.44 | 145 | Multilingual, widely used |
| Parakeet TDT 1.1B v3 | 6.32 | 3332 | Fast, accurate, multilingual |
Why Parakeet v3
Parakeet TDT 0.6B v3 offers the best of both worlds: near-top accuracy, blazing-fast inference, and multilingual support. For robotics applications requiring real-time transcription in varied environments, this combination is ideal.
Parakeet Architecture Deep Dive
Parakeet is a 600M parameter ASR model from NVIDIA. It takes in audio and outputs text with punctuation, capitalization, and word-level timestamps. Let's break down how it works.
The Big Picture
Audio → Mel Spectrogram → FastConformer Encoder → TDT Decoder → Text + Timestamps
Step 1: Audio to Mel Spectrogram
Raw audio is just a 1D array of air pressure measurements sampled at 16kHz. To make this useful for a neural network, we convert it to a mel spectrogram:
- Slice the audio into overlapping 25ms windows
- Apply an FFT to each window to get frequency content
- Map frequencies onto the mel scale (which matches human hearing)
- Take the log of the energies
The result is a 2D "image" with time on one axis and frequency on the other.
Step 2: FastConformer Encoder
The encoder processes this spectrogram through a stack of Conformer blocks. Each block combines:
- Convolutional layers for local patterns (individual sounds)
- Self-attention for global context (how sounds relate across time)
Before the Conformer blocks, convolutional layers downsample the time dimension by 4x, reducing computation while preserving the information that matters for speech.
Step 3: TDT Decoder
This is what makes Parakeet special. To understand why, we need to talk about a tricky problem in ASR: alignment.
The encoder outputs ~250 vectors for 10 seconds of audio, but the transcript might only have 50 characters. How do you know which encoder frames correspond to which characters?
CTC (Connectionist Temporal Classification) is the traditional solution. It adds a "blank" token and allows repeated outputs, then collapses them. For example, h h h _ _ i i i collapses to hi. CTC sums over all possible alignments during training, so you don't need frame-level labels. It's simple and fast, but it only predicts tokens, not timing.
TDT (Token-and-Duration Transducer) goes further. At each step it predicts two things:
- Token: what character or subword comes next
- Duration: how many encoder frames to skip before the next prediction
By predicting durations explicitly, TDT learns when each token occurs. This gives you accurate word-level timestamps for free, without any post-processing alignment step.
Training: Two Stages
Parakeet v3 uses a two-stage training process:
CTC Pretraining: Train the FastConformer encoder with a simple CTC head on ~660k hours of multilingual audio. CTC is faster to train, and this stage teaches the encoder to extract useful acoustic features across 25 languages.
TDT Fine-tuning: Replace the CTC head with the TDT decoder and fine-tune end-to-end for 150k steps on 128 A100 GPUs. This adds the timestamp prediction capability while building on the encoder's multilingual foundation.
Denoising with DeepFilterNet
Real-world audio is messy. Engines, wind, crowds. If we can clean up the audio before passing it to Parakeet, we should get better transcriptions. DeepFilterNet is a lightweight denoising model designed exactly for this.
Why DeepFilterNet?
DeepFilterNet runs in real-time on embedded devices (even a Raspberry Pi 4) with only ~2.3M parameters. For robotics applications where compute is limited, this matters.
The Two-Stage Architecture
DeepFilterNet exploits a key insight: speech has two components that need different treatment.
Stage 1: ERB Envelope Enhancement
This stage does "coarse" denoising. It works in the ERB (Equivalent Rectangular Bandwidth) domain, a perceptually-motivated frequency scale similar to mel spectrograms.
The model predicts a gain between 0 and 1 for each ERB band at each time frame:
- 0.0 = "this is all noise, suppress it"
- 1.0 = "this is speech, keep it"
These gains are applied to the overall spectral envelope, shaping which frequency bands to keep or suppress.
Stage 2: Deep Filtering
Stage 1 only touched magnitudes. But for voiced speech (vowels, etc.), the fine harmonic structure needs cleaning too, including the phase information.
Stage 2 predicts complex filter coefficients for each frequency bin (up to ~5kHz). These coefficients are convolved across time with the noisy spectrogram, reconstructing the periodic structure of speech.
Why predict a filter instead of the clean signal directly? The filter can combine information across neighboring time frames, tracking the pitch and harmonics as they evolve.
Shared Encoder
Both stages share a single GRU-based encoder. This is efficient (fewer parameters) and ensures both decoders work from the same understanding of the input. The recurrent architecture processes frame-by-frame, enabling real-time streaming.
Training
Training requires pairs of clean and noisy audio. You take clean speech, mix in random noise, and train the network to recover the original. The loss function compares spectrograms at multiple resolutions, forcing the model to get both fine temporal details and overall spectral structure correct.
Putting It Together
With both models understood, let's see how they work in practice.
The Pipeline
Audio Capture → Add Noise → DeepFilterNet → Parakeet → Transcription
For testing, we captured audio from YouTube using yt-dlp, resampled it to the appropriate format, and synthetically added engine noise. This lets us control the noise level and measure how well our pipeline handles it.
Generating Realistic Noise
White noise is easy to generate, but it doesn't sound like the real world. To simulate engine noise, we:
- Captured a sample of actual engine audio
- Computed its FFT to extract the frequency profile
- Generated white noise and shaped it to match that profile
The result is synthetic noise with the same frequency characteristics as a real engine: heavy in the low frequencies, tapering off at higher frequencies.
Dockerized Deployment
For practical use, we packaged both models in Docker containers. This makes deployment reproducible and keeps dependencies isolated. The containers expose simple APIs: send audio in, get cleaned audio or transcriptions out.
Results
On clean audio, Parakeet transcribes accurately with proper punctuation and capitalization. On noisy audio without denoising, word error rate increases noticeably. With DeepFilterNet in front, we recover most of that lost accuracy with minimal latency.
Conclusion
Building robust speech recognition for real-world environments requires more than just a good ASR model. Noise is inevitable, and handling it gracefully makes the difference between a demo and a deployable system.
In this post, we explored:
- Parakeet v3: A fast, accurate, multilingual ASR model that provides word-level timestamps through its TDT decoder
- DeepFilterNet: A lightweight denoiser that cleans audio in real-time using a two-stage approach for envelope and harmonic enhancement
- The pipeline: How to chain these models together for practical use
For robotics and voice interface applications, this combination offers a compelling trade-off: state-of-the-art accuracy, real-time performance, and robustness to noisy environments.
References
Models
- Parakeet TDT 0.6B v3 - NVIDIA's multilingual ASR model
- DeepFilterNet - Real-time speech enhancement
Papers
Datasets & Benchmarks
- Open ASR Leaderboard - Hugging Face ASR model comparison
Tools
- yt-dlp - Audio/video downloading
- librosa - Audio analysis
- NVIDIA NeMo - Toolkit for Parakeet inference