All posts
EngineeringFebruary 2026

Deploying Indic LLMs on-premise: an infra guide

GPU selection, INT4 quantization, ONNX runtime optimization, and serving Llama 3 Indic at 200ms inference on a 2×H100 setup.

Verbalyze Engineering12 min readEngineering

Why Self-Host?

For enterprises in BFSI, healthcare, and government, the default cloud API model is not viable. Every inference call sends sensitive data to an external provider. DPDP Act 2023, RBI guidelines, and IRDAI circulars all impose strict data residency requirements.

Self-hosting solves this: all inference stays inside your network perimeter.

Hardware Selection

For production at scale, we recommend NVIDIA H100 80GB or H200 141GB GPUs.

ModelGPUVRAM RequiredThroughput
Llama 3 Indic 8B (INT4)1× H100~6 GB185 tok/s
Llama 3 Indic 70B (INT4)1× H200~42 GB85 tok/s
Qwen 2.5 Indic 14B (INT4)1× H100~9 GB145 tok/s

INT4 Quantization

We use GPTQ-4bit quantization via AutoGPTQ. The quantized weight files are signed and distributed as encrypted archives.

python quantize.py \
  --model-path ./llama3-indic-8b-fp16 \
  --output-path ./llama3-indic-8b-int4 \
  --bits 4 \
  --group-size 128

ONNX Runtime Export

Post-quantization, we export to ONNX for vendor-agnostic serving:

from optimum.exporters.onnx import main_export
main_export(model_name_or_path="./llama3-indic-8b-int4", output="./onnx_model")

Kubernetes Deployment

A Helm chart is provided with HPA configured on GPU utilization metrics:

autoscaling:
  enabled: true
  minReplicas: 1
  maxReplicas: 8
  metrics:
    - type: External
      external:
        metric:
          name: gpu_utilization_avg
        target:
          type: AverageValue
          averageValue: 75

Explore more insights from the Verbalyze team

Back to Blog