Quantize models for CPU and Apple Silicon
gguf-quantizationskillsetup L2★9,423
Orchestra-Research/AI-Research-SKILLs ↗What it does
Convert and quantize LLMs to GGUF format for CPU/Apple Silicon inference
Best for
Deploying LLMs on consumer hardware (MacBook M1+) or servers without NVIDIA GPU when universal hardware support is required
Inputs
- · HuggingFace model path
- · quantization type (Q2_K to Q8_0)
- · optional calibration text for importance matrix
Outputs
- · model-QUANT.gguf file
- · executable binaries for llama.cpp
Requires
- · llama.cpp
- · llama-cpp-python
- · Python
Preconditions
llama.cpp built and in PATH; HuggingFace model downloaded; optional: calibration data for better imatrix
Failure modes
Inference hang if imatrix-quantized model run without imatrix; Q2_K severe accuracy loss; .gguf incompatible between llama.cpp versions
Trust signals
- · llama.cpp is de facto standard for GGUF
- · Apple Silicon Metal acceleration built-in
- · K-quants (Q4_K_M, Q5_K_M) endorsed by Llama team