Optimize performance with SIMD and GPU kernels
perf-devsubagentsetup L4★393
quantumaikr/quant.cpp ↗What it does
Optimize CPU/GPU kernels with SIMD (ARM NEON, AVX2) and CUDA/Metal
Best for
Quantization/numerical compute when measured speedup with verified correctness beats speculation
Inputs
- · Generic C implementation baseline
- · Target platform (ARM, x86, CUDA, Metal)
- · Performance bottleneck identification
Outputs
- · Optimized SIMD/GPU kernel code
- · Benchmark speedup metrics
- · Verification output matches baseline (bit-exact or tolerance)
Requires
- · ARM NEON
- · x86 AVX2
- · CUDA
- · Metal
- · Benchmark suite
Preconditions
Generic baseline measured; bottleneck identified; reference implementations available
Failure modes
SIMD output diverges from generic; GPU memory limits exceeded; platform-specific bugs; benchmark methodology flawed
Trust signals
- · Measure generic first, optimize after
- · Bit-exact or tolerance-validated against baseline
- · Reference implementations from llama.cpp, vLLM cited
- · Speedup numbers from actual benchmarks