Deploying Indic LLMs on-premise: an infra guide
GPU selection, INT4 quantization, ONNX runtime optimization, and serving Llama 3 Indic at 200ms inference on a 2×H100 setup.
Why Self-Host?
For enterprises in BFSI, healthcare, and government, the default cloud API model is not viable. Every inference call sends sensitive data to an external provider. DPDP Act 2023, RBI guidelines, and IRDAI circulars all impose strict data residency requirements.
Self-hosting solves this: all inference stays inside your network perimeter.
Hardware Selection
For production at scale, we recommend NVIDIA H100 80GB or H200 141GB GPUs.
| Model | GPU | VRAM Required | Throughput |
|---|---|---|---|
| Llama 3 Indic 8B (INT4) | 1× H100 | ~6 GB | 185 tok/s |
| Llama 3 Indic 70B (INT4) | 1× H200 | ~42 GB | 85 tok/s |
| Qwen 2.5 Indic 14B (INT4) | 1× H100 | ~9 GB | 145 tok/s |
INT4 Quantization
We use GPTQ-4bit quantization via AutoGPTQ. The quantized weight files are signed and distributed as encrypted archives.
python quantize.py \
--model-path ./llama3-indic-8b-fp16 \
--output-path ./llama3-indic-8b-int4 \
--bits 4 \
--group-size 128ONNX Runtime Export
Post-quantization, we export to ONNX for vendor-agnostic serving:
from optimum.exporters.onnx import main_export
main_export(model_name_or_path="./llama3-indic-8b-int4", output="./onnx_model")Kubernetes Deployment
A Helm chart is provided with HPA configured on GPU utilization metrics:
autoscaling:
enabled: true
minReplicas: 1
maxReplicas: 8
metrics:
- type: External
external:
metric:
name: gpu_utilization_avg
target:
type: AverageValue
averageValue: 75Explore more insights from the Verbalyze team
Back to Blog