Quantize 70B models for consumer GPUs

gptqskillsetup L29,423
Orchestra-Research/AI-Research-SKILLs
What it does

Post-training 4-bit quantization for LLMs with minimal accuracy loss

Best for

Deploying 70B+ models on A100/H100 when 4× compression and <2% accuracy loss is acceptable

Inputs
  • · base model
  • · calibration dataset (50-100 examples)
  • · quantization config (groupsize, desc_act, bits)
Outputs
  • · quantized model
  • · saved .safetensors or .pt
Requires
  • · gptq-for-llama
  • · transformers
  • · torch
  • · datasets
Preconditions

NVIDIA GPU with 24GB+ VRAM; calibration data available; base model loaded in memory

Failure modes

Calibration on wrong dataset domain causes drift; out-of-memory if groupsize too small; activation quantization can break attention

Trust signals
  • · Paper GPTQ (Frantar et al.) published at ICLR 2023
  • · Scales to 70B and 405B models
  • · Works with grouped quantization