Reduce model memory by 50-75 percent

quantizing-models-bitsandbytesskillsetup L29,423
Orchestra-Research/AI-Research-SKILLs
What it does

Quantize LLMs to 8/4-bit formats for memory reduction

Best for

Fitting 7B+ models on consumer GPUs (8-16GB VRAM) when accuracy tolerance permits <1% degradation

Inputs
  • · HuggingFace model path
  • · target quantization level (8-bit or 4-bit)
  • · BitsAndBytesConfig
Outputs
  • · quantized model in memory
  • · device-mapped tensor
Requires
  • · bitsandbytes
  • · transformers
  • · accelerate
  • · torch
Preconditions

NVIDIA GPU with CUDA; transformers and bitsandbytes installed; sufficient vRAM for loading quantized model

Failure modes

Accuracy degradation at 4-bit; int8_threshold miscalibration causes outlier errors; out-of-memory if vRAM insufficient

Trust signals
  • · Supports both INT8 and NF4/FP4 formats per paper
  • · QLoRA enables fine-tuning quantized models
  • · 50-75% reduction empirically verified