Code-switching in Indian languages: the unsolved NLP problem
Hindi-English, Tanglish, Hinglish — why generic models fail and how we trained on 10,000 hours of natural code-switched audio.
What is Code-Switching?
Code-switching is the practice of alternating between two or more languages within a single conversation — or even a single sentence. In India, it is not a linguistic edge case. It is the default mode of communication for hundreds of millions of urban speakers.
Hinglish: "Aaj meeting ke baad mujhe report submit karni hai by EOD."
Tanglish: "Naan office-ku poi project submit panni varen."
Why Generic Models Fail
Models trained on monolingual corpora — even very large ones — catastrophically fail on code-switched input because:
Our Approach
We curated 10,000 hours of naturally occurring code-switched audio from BPO call recordings, radio broadcasts, and social media video. All data was collected with explicit consent and anonymized.
Training involved a multilingual encoder trained jointly across 30 Indian languages, allowing the model to share representations across language boundaries rather than treating each language as isolated.
Benchmark Results
On our internal Hinglish test set (1,200 utterances, naturalistic BPO calls):
Explore more insights from the Verbalyze team
Back to Blog