Speech-to-Text

Indian ASR that actually understands Bharat

Real-time transcription for 30+ Indian languages. Under 90ms latency. Domain-adapted for banking, healthcare, and retail — right out of the box. Trained on 50,000+ hours of real Indian audio, not translated corpora.

<90ms Latency3.2% Hindi WER30+ LanguagesCode-switchingDomain AdaptationSpeaker DiarizationPII Redaction
<90ms
Streaming Latency
3.2%
Hindi WER
30+
Indian Languages
50K+
Training Hours
99.9%
Uptime SLA
Capabilities

Enterprise-grade ASR features

Every feature is production-tested across millions of minutes of real Indian enterprise audio.

Sub-90ms Streaming ASR

WebSocket and gRPC streaming endpoints deliver first-word hypothesis in under 90ms. Character-level incremental output lets you start processing before the speaker finishes. Designed for real-time call centre and voice agent pipelines where latency is revenue.

30+ Indian Languages & Dialects

Hindi, Tamil, Telugu, Kannada, Malayalam, Marathi, Bengali, Gujarati, Punjabi, Odia, Urdu, Assamese, Maithili, Bhojpuri, Rajasthani, Haryanvi, and 14 more. Each language model is trained on native speaker data — not translated from English corpora.

Code-Switching Intelligence

Hinglish, Tanglish, Manglish — India's natural multilingual speech is handled natively. No post-processing hacks. Our models are trained on real code-switched call recordings from BPO, banking, and retail environments, covering 200M+ code-switching instances.

Domain Vocabulary Adaptation

Pre-trained domain models for BFSI (account numbers, IFSC codes, EMI vocabulary), healthcare (ICD codes, drug names, clinical terms), retail (SKUs, order IDs, courier jargon), and legal. Reduces domain WER by up to 35% vs generic models.

Automatic PII Redaction

Real-time redaction of Aadhaar numbers, PAN cards, bank account numbers, IFSC codes, credit card numbers, phone numbers, and UPI IDs before transcript storage. Configurable redaction policies. Full DPDP Act 2023 compliance with audit trails.

Speaker Diarization

Automatically separate and label agent vs customer speech in dual-channel call recordings. Outputs speaker-timestamped JSON. Critical for call QA, compliance, and AHT calculation. Supports multi-speaker meeting transcription up to 8 participants.

Architecture

How our ASR pipeline works

01
Audio In

Send raw PCM audio over WebSocket or upload a file via REST. Supports 8kHz telephony and 16kHz wideband audio.

02
VAD & Segmentation

Voice Activity Detection filters silence, segments speech, and handles overlapping speech in real-time.

03
Language ID

Automatic language identification per utterance. No need to pre-declare language for multilingual calls.

04
ASR Inference

CTC-based acoustic model + language model beam search. Domain vocabulary injected at decode time.

05
Post-Processing

Punctuation, number normalization, entity formatting, PII redaction applied in a streaming post-processor.

06
Transcript Out

JSON response with text, confidence scores, speaker labels, timestamps, and redacted fields.

Language Coverage

Native models for every Indian language

Each language model is independently trained on native speaker audio — not derived from cross-lingual transfer or English model fine-tuning. This means our models capture real phonetic patterns, regional accents, and dialectal variation that generic multilingual models miss.

  • Separate acoustic + language models per language for maximum accuracy
  • Dialect variants: Bhojpuri-Hindi, Haryanvi-Hindi, Tulu-Kannada support
  • Script-aware: Devanagari, Tamil, Telugu, Kannada, Bengali scripts
  • Automatic Language Identification per utterance — no need to declare language
  • Continuous language model updates based on new enterprise feedback data
Request Language Trial
Language Accuracy Benchmarks
LanguageBCP-47WERKey Domains
Hindihi-IN3.2%BFSI, BPO
Tamilta-IN4.1%Healthcare, Retail
Telugute-IN4.5%Agri, Govt
Kannadakn-IN5.0%EdTech, IT
Malayalamml-IN4.8%Healthcare
Marathimr-IN4.3%BFSI, BPO
Bengalibn-IN4.7%EdTech, Gov
Gujaratigu-IN4.4%BFSI, Trade
+ 22 more languages available
Integration

Integrate in minutes, not weeks

Official SDKs for Python and Node.js. REST API for any language. WebSocket for real-time streaming. All APIs documented with runnable examples.

  • Python SDK (pip install verbalyze)
  • Node.js SDK (npm install @verbalyze/sdk)
  • REST API — works with any language
  • WebSocket streaming endpoint
  • Postman collection available
transcribe.py
import verbalyze as vb

client = vb.Client(api_key="vb_sk_...")

# Batch transcription
result = client.transcribe(
    audio="call_recording.wav",
    language="hi-IN",
    domain="banking",
    diarize=True,      # separate speakers
    pii_redact=True,   # auto-redact PII
)

print(result.text)
# → "नमस्ते, मेरा [ACCOUNT] बंद हो गया है"
print(result.speakers)
# → [{"speaker": "agent", "start": 0.0, "end": 2.1},
#    {"speaker": "customer", "start": 2.3, "end": 5.6}]
print(f"Latency: {result.latency_ms}ms | WER confidence: {result.confidence}")
Use Cases

Built for India's most demanding voice use cases

📞BPO / Contact Centres

Live Call Transcription

Real-time agent+customer transcript for live call monitoring, QA scoring, and supervisor alerts. Reduces manual QA effort by 80%.

🏦BFSI

EMI & Loan Collections

Transcribe collection calls, extract promise-to-pay commitments, and auto-populate CRM with disposition outcomes. 47% AHT reduction.

🏥Healthcare

Doctor Dictation

Clinical vocabulary ASR for doctor notes, OPD prescriptions, and discharge summaries. ICD-10 code recognition built-in.

🛒Retail / E-Commerce

Customer Support Automation

Transcribe and classify inbound support calls to auto-route to the right resolution flow. Handles 10,000+ calls/day.

🎓EdTech

Voice-Based Assessments

Evaluate spoken answers in Hindi and regional languages for language learning, pronunciation scoring, and oral exams.

🏛️Government / NGO

Government Field Surveys

Field data collection in regional languages. Voice forms for census, agriculture, and healthcare surveys in rural India.

Common questions

What audio formats do you support?

WAV, MP3, FLAC, OGG, M4A, WebM, and raw PCM. Streaming accepts 16-bit PCM at 8kHz (telephony) or 16kHz (wideband). Automatic format detection for batch uploads.

How does streaming work?

Connect to wss://api.verbalyze.in/v2/stt/stream via WebSocket. Send audio chunks and receive incremental transcription tokens in real-time. Supports backpressure and reconnection.

Can I get word-level timestamps?

Yes. Set word_timestamps=true in your request to receive start and end times for each word in the transcript. Useful for subtitle generation and call analytics.

How is accuracy measured?

We report Word Error Rate (WER) on a held-out benchmark dataset of native Indian audio. Our Hindi model achieves 3.2% WER. Domain-fine-tuned models perform 15–35% better on domain vocabulary.

Is on-premise deployment available?

Yes. Our ASR models are available as Docker containers for on-premise or private cloud deployment. Contact us for GPU requirements and deployment support.

Ready to transcribe India's voice?

Get 10,000 free API minutes. No credit card required.