EngineeringFebruary 2026

Deploying Indic LLMs on-premise: an infra guide

GPU selection, INT4 quantization, ONNX runtime optimization, and serving Llama 3 Indic at 200ms inference on a 2×H100 setup.

Verbalyze Engineering12 min readEngineering

Why Self-Host?

For enterprises in BFSI, healthcare, and government, the default cloud API model is not viable. Every inference call sends sensitive data to an external provider. DPDP Act 2023, RBI guidelines, and IRDAI circulars all impose strict data residency requirements.

Self-hosting solves this: all inference stays inside your network perimeter.

Hardware Selection

For production at scale, we recommend NVIDIA H100 80GB or H200 141GB GPUs.

Model	GPU	VRAM Required	Throughput
Llama 3 Indic 8B (INT4)	1× H100	~6 GB	185 tok/s
Llama 3 Indic 70B (INT4)	1× H200	~42 GB	85 tok/s
Qwen 2.5 Indic 14B (INT4)	1× H100	~9 GB	145 tok/s

INT4 Quantization

We use GPTQ-4bit quantization via AutoGPTQ. The quantized weight files are signed and distributed as encrypted archives.

python quantize.py \
  --model-path ./llama3-indic-8b-fp16 \
  --output-path ./llama3-indic-8b-int4 \
  --bits 4 \
  --group-size 128

ONNX Runtime Export

Post-quantization, we export to ONNX for vendor-agnostic serving:

from optimum.exporters.onnx import main_export
main_export(model_name_or_path="./llama3-indic-8b-int4", output="./onnx_model")

Kubernetes Deployment

A Helm chart is provided with HPA configured on GPU utilization metrics:

autoscaling:
  enabled: true
  minReplicas: 1
  maxReplicas: 8
  metrics:
    - type: External
      external:
        metric:
          name: gpu_utilization_avg
        target:
          type: AverageValue
          averageValue: 75

Explore more insights from the Verbalyze team

Back to Blog