Compress large language models to 4-bit
awq-quantizationskillsetup L2★9,423
Orchestra-Research/AI-Research-SKILLs ↗What it does
Apply Activation-Weighted Quantization to compress LLM weights to INT4
Best for
Compressing large LLMs for edge deployment while maintaining near-FP16 accuracy.
Inputs
- · full-precision LLM checkpoint
- · calibration data
Outputs
- · INT4 quantized model
- · reduced size (4x-8x)
Requires
- · awq library
- · torch
Preconditions
Model weights accessible, calibration dataset available (128-256 examples)
Failure modes
- · accuracy drop if calibration insufficient
- · inference speed not improved on unsupported hardware
- · VRAM still high if batch-quantized
Trust signals
- · 4x-8x compression ratios cited
- · calibration data requirements specified
- · accuracy loss benchmarked