Quantize models for CPU and Apple Silicon

gguf-quantizationskillsetup L29,423
Orchestra-Research/AI-Research-SKILLs
What it does

Convert and quantize LLMs to GGUF format for CPU/Apple Silicon inference

Best for

Deploying LLMs on consumer hardware (MacBook M1+) or servers without NVIDIA GPU when universal hardware support is required

Inputs
  • · HuggingFace model path
  • · quantization type (Q2_K to Q8_0)
  • · optional calibration text for importance matrix
Outputs
  • · model-QUANT.gguf file
  • · executable binaries for llama.cpp
Requires
  • · llama.cpp
  • · llama-cpp-python
  • · Python
Preconditions

llama.cpp built and in PATH; HuggingFace model downloaded; optional: calibration data for better imatrix

Failure modes

Inference hang if imatrix-quantized model run without imatrix; Q2_K severe accuracy loss; .gguf incompatible between llama.cpp versions

Trust signals
  • · llama.cpp is de facto standard for GGUF
  • · Apple Silicon Metal acceleration built-in
  • · K-quants (Q4_K_M, Q5_K_M) endorsed by Llama team