ResearchJune 2026

How we achieved 3.2% WER on Hindi ASR

A deep dive into our training methodology — 50,000 hours of native Indian audio, data augmentation for dialects, and CTC-based fine-tuning.

Verbalyze Research Team8 min readResearch

Overview

Achieving 3.2% Word Error Rate (WER) on Hindi Automatic Speech Recognition (ASR) is the result of three years of dedicated model training, dataset curation, and architectural iteration.

Dataset Construction

Our training corpus spans 50,000 hours of native Indian audio — recorded across 14 Hindi dialect zones including Braj Bhasha, Avadhi, Bhojpuri-accented Hindi, Haryanvi-accented Hindi, and standard Khari Boli. Every audio file was transcribed by native speakers using a strict double-blind review protocol.

We deliberately excluded translated corpora or synthetic TTS-generated data from training. Generic LLMs trained on translated data routinely fail on natural Indian speech cadences — our models learn from real recordings.

Architecture

We use a CTC + Attention hybrid decoder architecture built on a 300M parameter conformer encoder. The conformer architecture is particularly effective for Indian languages because:

Local self-attention captures short-range phoneme patterns

Depthwise convolution captures sub-word morphology

Global self-attention captures long-range prosodic patterns

Data Augmentation

To handle dialectal variation, we applied:

▸Speed perturbation at 0.9×, 1.0×, and 1.1× rates

▸Pitch shifting ±2 semitones

▸Room impulse response (RIR) convolution from 50 recorded environments

▸Additive noise from crowdsourced Indian urban and rural background recordings

Results

Final benchmark on the IndicSUPERB Hindi test set: 3.2% WER, outperforming the next best public model by 2.1 percentage points.

Explore more insights from the Verbalyze team

Back to Blog