cyberneticlibrary

Optimize performance with SIMD and GPU kernels

perf-devsubagentsetup L4393
quantumaikr/quant.cpp
What it does

Optimize CPU/GPU kernels with SIMD (ARM NEON, AVX2) and CUDA/Metal

Best for

Quantization/numerical compute when measured speedup with verified correctness beats speculation

Inputs
  • · Generic C implementation baseline
  • · Target platform (ARM, x86, CUDA, Metal)
  • · Performance bottleneck identification
Outputs
  • · Optimized SIMD/GPU kernel code
  • · Benchmark speedup metrics
  • · Verification output matches baseline (bit-exact or tolerance)
Requires
  • · ARM NEON
  • · x86 AVX2
  • · CUDA
  • · Metal
  • · Benchmark suite
Preconditions

Generic baseline measured; bottleneck identified; reference implementations available

Failure modes

SIMD output diverges from generic; GPU memory limits exceeded; platform-specific bugs; benchmark methodology flawed

Trust signals
  • · Measure generic first, optimize after
  • · Bit-exact or tolerance-validated against baseline
  • · Reference implementations from llama.cpp, vLLM cited
  • · Speedup numbers from actual benchmarks