Quantize 70B models for consumer GPUs
gptqskillsetup L2★9,423
Orchestra-Research/AI-Research-SKILLs ↗What it does
Post-training 4-bit quantization for LLMs with minimal accuracy loss
Best for
Deploying 70B+ models on A100/H100 when 4× compression and <2% accuracy loss is acceptable
Inputs
- · base model
- · calibration dataset (50-100 examples)
- · quantization config (groupsize, desc_act, bits)
Outputs
- · quantized model
- · saved .safetensors or .pt
Requires
- · gptq-for-llama
- · transformers
- · torch
- · datasets
Preconditions
NVIDIA GPU with 24GB+ VRAM; calibration data available; base model loaded in memory
Failure modes
Calibration on wrong dataset domain causes drift; out-of-memory if groupsize too small; activation quantization can break attention
Trust signals
- · Paper GPTQ (Frantar et al.) published at ICLR 2023
- · Scales to 70B and 405B models
- · Works with grouped quantization