Optimize performance with SIMD and GPU kernels

perf-devsubagentsetup L4★393

What it does

Optimize CPU/GPU kernels with SIMD (ARM NEON, AVX2) and CUDA/Metal

Best for

Quantization/numerical compute when measured speedup with verified correctness beats speculation

Inputs

Outputs

Requires

Preconditions

Generic baseline measured; bottleneck identified; reference implementations available

Failure modes

SIMD output diverges from generic; GPU memory limits exceeded; platform-specific bugs; benchmark methodology flawed

Trust signals