Deploy LLMs on consumer hardware
llama-cppskillsetup L3★9,423
Orchestra-Research/AI-Research-SKILLs ↗What it does
Run LLM inference on CPU and non-NVIDIA hardware with quantization
Best for
Edge deployment and local inference on Apple Silicon and non-NVIDIA hardware without Docker complexity.
Inputs
- · GGUF-quantized model file
- · context window size
- · hardware accelerator flag (Metal/ROCm/CUDA)
Outputs
- · token stream (stdout or server)
- · OpenAI-compatible API responses
- · inference metrics
Requires
- · llama.cpp binary
- · GGUF model repository
- · optional: Metal (macOS) / ROCm (AMD) / CUDA (NVIDIA)
Preconditions
Model must be GGUF format; CPU fallback always available; sufficient RAM for quantized model
Failure modes
- · Quantization artifacts at very low bits (Q2_K)
- · Slow inference on CPU without GPU offload
- · Context length exceeded
- · Memory overflow on embedded devices
Trust signals
- · Pure C/C++ no dependencies
- · Metal acceleration on M1/M2/M3
- · OpenAI API compatibility