Speed up transformer inference 2-4x

optimizing-attention-flashskillsetup L29,423
Orchestra-Research/AI-Research-SKILLs
What it does

Accelerate transformer attention 2-4x with Flash Attention on NVIDIA GPUs

Best for

Long-context LLM inference and training where attention is the bottleneck and 2-4x speedup with 10-20x memory savings is critical

Inputs
  • · query/key/value tensors (batch, seqlen, nheads, headdim)
  • · dropout_p, causal flag, window_size
Outputs
  • · attention output tensor same shape as input
Requires
  • · flash-attn
  • · torch>=2.2
  • · transformers
  • · NVIDIA GPU with CUDA
Preconditions

PyTorch 2.2+; CUDA 12.0+ for flash-attn library; sequence length >512 for benefit; float16 or lower precision

Failure modes

Numerical drift vs standard attention <1e-3; window_size misconfiguration breaks causal masking; H100 FP8 not supported in older flash-attn

Trust signals
  • · Flash-Attention papers (Dao et al.) published at top venues
  • · PyTorch native SDPA backend validates algorithm
  • · 10-20x memory reduction empirically shown