Understanding Speech Recognition in Noisy Environments

February 10, 2026

speech-recognition ASR denoising robotics

Audio is one of the most natural interfaces for robotics. We speak to each other, so why not speak to machines? The problem is that the real world is noisy. Engines hum, crowds chatter, wind blows. If your robot can't understand you over background noise, that natural interface breaks down.

I wanted to understand how modern speech recognition handles this challenge. After exploring the Open ASR Leaderboard, I selected Parakeet, a model that strikes a nice balance between accuracy (low word error rate) and speed (real-time factor). I also dug into DeepFilterNet, a denoising model that cleans up audio before transcription.

This post covers what I learned: how to capture and corrupt audio, how these models work, and how to deploy them for real use.

Model Selection: The Open ASR Leaderboard

When choosing an ASR model, two metrics matter most: Word Error Rate (WER) and Real-Time Factor (RTFx). WER measures the number of incorrect words—lower is better. RTFx measures speed; an RTFx of 100 means one minute of audio is transcribed in 0.6 seconds.

The Open ASR Leaderboard makes it easy to compare these trade-offs across 60+ open-source and proprietary systems.

The accuracy vs speed trade-off

Model	WER (%)	RTFx	Notes
Canary Qwen 2.5B	5.63	418	Most accurate, slower
Whisper Large v3	7.44	145	Multilingual, widely used
Parakeet TDT 1.1B v3	6.32	3332	Fast, accurate, multilingual

Why Parakeet v3

Parakeet TDT 0.6B v3 offers the best of both worlds: near-top accuracy, blazing-fast inference, and multilingual support. For robotics applications requiring real-time transcription in varied environments, this combination is ideal.

Parakeet Architecture Deep Dive

Parakeet is a 600M parameter ASR model from NVIDIA. It takes in audio and outputs text with punctuation, capitalization, and word-level timestamps. Let's break down how it works.

The Big Picture

Audio → Mel Spectrogram → FastConformer Encoder → TDT Decoder → Text + Timestamps

Step 1: Audio to Mel Spectrogram

Raw audio is just a 1D array of air pressure measurements sampled at 16kHz. To make this useful for a neural network, we convert it to a mel spectrogram:

Slice the audio into overlapping 25ms windows
Apply an FFT to each window to get frequency content
Map frequencies onto the mel scale (which matches human hearing)
Take the log of the energies

The result is a 2D "image" with time on one axis and frequency on the other.

Step 2: FastConformer Encoder

The encoder processes this spectrogram through a stack of Conformer blocks. Each block combines:

Convolutional layers for local patterns (individual sounds)
Self-attention for global context (how sounds relate across time)

Before the Conformer blocks, convolutional layers downsample the time dimension by 4x, reducing computation while preserving the information that matters for speech.

Step 3: TDT Decoder

This is what makes Parakeet special. To understand why, we need to talk about a tricky problem in ASR: alignment.

The encoder outputs ~250 vectors for 10 seconds of audio, but the transcript might only have 50 characters. How do you know which encoder frames correspond to which characters?

CTC (Connectionist Temporal Classification) is the traditional solution. It adds a "blank" token and allows repeated outputs, then collapses them. For example, h h h _ _ i i i collapses to hi. CTC sums over all possible alignments during training, so you don't need frame-level labels. It's simple and fast, but it only predicts tokens, not timing.

TDT (Token-and-Duration Transducer) goes further. At each step it predicts two things:

Token: what character or subword comes next
Duration: how many encoder frames to skip before the next prediction

By predicting durations explicitly, TDT learns when each token occurs. This gives you accurate word-level timestamps for free, without any post-processing alignment step.

Training: Two Stages

Parakeet v3 uses a two-stage training process:

CTC Pretraining: Train the FastConformer encoder with a simple CTC head on ~660k hours of multilingual audio. CTC is faster to train, and this stage teaches the encoder to extract useful acoustic features across 25 languages.
TDT Fine-tuning: Replace the CTC head with the TDT decoder and fine-tune end-to-end for 150k steps on 128 A100 GPUs. This adds the timestamp prediction capability while building on the encoder's multilingual foundation.

Denoising with DeepFilterNet

Real-world audio is messy. Engines, wind, crowds. If we can clean up the audio before passing it to Parakeet, we should get better transcriptions. DeepFilterNet is a lightweight denoising model designed exactly for this.

Why DeepFilterNet?

DeepFilterNet runs in real-time on embedded devices (even a Raspberry Pi 4) with only ~2.3M parameters. For robotics applications where compute is limited, this matters.

The Two-Stage Architecture

DeepFilterNet exploits a key insight: speech has two components that need different treatment.

Stage 1: ERB Envelope Enhancement

This stage does "coarse" denoising. It works in the ERB (Equivalent Rectangular Bandwidth) domain, a perceptually-motivated frequency scale similar to mel spectrograms.

The model predicts a gain between 0 and 1 for each ERB band at each time frame:

0.0 = "this is all noise, suppress it"
1.0 = "this is speech, keep it"

These gains are applied to the overall spectral envelope, shaping which frequency bands to keep or suppress.

Stage 2: Deep Filtering

Stage 1 only touched magnitudes. But for voiced speech (vowels, etc.), the fine harmonic structure needs cleaning too, including the phase information.

Stage 2 predicts complex filter coefficients for each frequency bin (up to ~5kHz). These coefficients are convolved across time with the noisy spectrogram, reconstructing the periodic structure of speech.

Why predict a filter instead of the clean signal directly? The filter can combine information across neighboring time frames, tracking the pitch and harmonics as they evolve.

Shared Encoder

Both stages share a single GRU-based encoder. This is efficient (fewer parameters) and ensures both decoders work from the same understanding of the input. The recurrent architecture processes frame-by-frame, enabling real-time streaming.

Training

Training requires pairs of clean and noisy audio. You take clean speech, mix in random noise, and train the network to recover the original. The loss function compares spectrograms at multiple resolutions, forcing the model to get both fine temporal details and overall spectral structure correct.

Putting It Together

With both models understood, let's see how they work in practice.

The Pipeline

Audio Capture → Add Noise → DeepFilterNet → Parakeet → Transcription

For testing, we captured audio from YouTube using yt-dlp, resampled it to the appropriate format, and synthetically added engine noise. This lets us control the noise level and measure how well our pipeline handles it.

Generating Realistic Noise

White noise is easy to generate, but it doesn't sound like the real world. To simulate engine noise, we:

Captured a sample of actual engine audio
Computed its FFT to extract the frequency profile
Generated white noise and shaped it to match that profile

The result is synthetic noise with the same frequency characteristics as a real engine: heavy in the low frequencies, tapering off at higher frequencies.

Dockerized Deployment

For practical use, we packaged both models in Docker containers. This makes deployment reproducible and keeps dependencies isolated. The containers expose simple APIs: send audio in, get cleaned audio or transcriptions out.

Results

On clean audio, Parakeet transcribes accurately with proper punctuation and capitalization. On noisy audio without denoising, word error rate increases noticeably. With DeepFilterNet in front, we recover most of that lost accuracy with minimal latency.

Conclusion

Building robust speech recognition for real-world environments requires more than just a good ASR model. Noise is inevitable, and handling it gracefully makes the difference between a demo and a deployable system.

In this post, we explored:

Parakeet v3: A fast, accurate, multilingual ASR model that provides word-level timestamps through its TDT decoder
DeepFilterNet: A lightweight denoiser that cleans audio in real-time using a two-stage approach for envelope and harmonic enhancement
The pipeline: How to chain these models together for practical use

For robotics and voice interface applications, this combination offers a compelling trade-off: state-of-the-art accuracy, real-time performance, and robustness to noisy environments.

References

Models

Parakeet TDT 0.6B v3 - NVIDIA's multilingual ASR model
DeepFilterNet - Real-time speech enhancement

Papers

DeepFilterNet2: Towards Real-Time Speech Enhancement on Embedded Devices for Full-Band Audio

Datasets & Benchmarks

Open ASR Leaderboard - Hugging Face ASR model comparison

Tools

yt-dlp - Audio/video downloading
librosa - Audio analysis
NVIDIA NeMo - Toolkit for Parakeet inference