Compress large language models to 4-bit

awq-quantizationskillsetup L29,423
Orchestra-Research/AI-Research-SKILLs
What it does

Apply Activation-Weighted Quantization to compress LLM weights to INT4

Best for

Compressing large LLMs for edge deployment while maintaining near-FP16 accuracy.

Inputs
  • · full-precision LLM checkpoint
  • · calibration data
Outputs
  • · INT4 quantized model
  • · reduced size (4x-8x)
Requires
  • · awq library
  • · torch
Preconditions

Model weights accessible, calibration dataset available (128-256 examples)

Failure modes
  • · accuracy drop if calibration insufficient
  • · inference speed not improved on unsupported hardware
  • · VRAM still high if batch-quantized
Trust signals
  • · 4x-8x compression ratios cited
  • · calibration data requirements specified
  • · accuracy loss benchmarked