Reduce model memory by 50-75 percent
quantizing-models-bitsandbytesskillsetup L2★9,423
Orchestra-Research/AI-Research-SKILLs ↗What it does
Quantize LLMs to 8/4-bit formats for memory reduction
Best for
Fitting 7B+ models on consumer GPUs (8-16GB VRAM) when accuracy tolerance permits <1% degradation
Inputs
- · HuggingFace model path
- · target quantization level (8-bit or 4-bit)
- · BitsAndBytesConfig
Outputs
- · quantized model in memory
- · device-mapped tensor
Requires
- · bitsandbytes
- · transformers
- · accelerate
- · torch
Preconditions
NVIDIA GPU with CUDA; transformers and bitsandbytes installed; sufficient vRAM for loading quantized model
Failure modes
Accuracy degradation at 4-bit; int8_threshold miscalibration causes outlier errors; out-of-memory if vRAM insufficient
Trust signals
- · Supports both INT8 and NF4/FP4 formats per paper
- · QLoRA enables fine-tuning quantized models
- · 50-75% reduction empirically verified