All posts
ResearchMarch 2026

Code-switching in Indian languages: the unsolved NLP problem

Hindi-English, Tanglish, Hinglish — why generic models fail and how we trained on 10,000 hours of natural code-switched audio.

Verbalyze Research Team10 min readResearch

What is Code-Switching?

Code-switching is the practice of alternating between two or more languages within a single conversation — or even a single sentence. In India, it is not a linguistic edge case. It is the default mode of communication for hundreds of millions of urban speakers.

Hinglish: "Aaj meeting ke baad mujhe report submit karni hai by EOD."

Tanglish: "Naan office-ku poi project submit panni varen."

Why Generic Models Fail

Models trained on monolingual corpora — even very large ones — catastrophically fail on code-switched input because:

  • Subword tokenizers create out-of-vocabulary fragments for code-switched words
  • Language model priors strongly penalize mixing
  • Acoustic models have no exposure to the prosodic patterns of code-switched speech
  • Our Approach

    We curated 10,000 hours of naturally occurring code-switched audio from BPO call recordings, radio broadcasts, and social media video. All data was collected with explicit consent and anonymized.

    Training involved a multilingual encoder trained jointly across 30 Indian languages, allowing the model to share representations across language boundaries rather than treating each language as isolated.

    Benchmark Results

    On our internal Hinglish test set (1,200 utterances, naturalistic BPO calls):

  • Generic Whisper Large v3: 28.4% WER
  • Google Cloud Speech (Indic): 21.7% WER
  • Verbalyze Hinglish ASR: 8.9% WER
  • Explore more insights from the Verbalyze team

    Back to Blog