Speed up transformer inference 2-4x

optimizing-attention-flashskillsetup L2★9,423

What it does

Accelerate transformer attention 2-4x with Flash Attention on NVIDIA GPUs

Best for

Long-context LLM inference and training where attention is the bottleneck and 2-4x speedup with 10-20x memory savings is critical

Inputs

Outputs

Requires

Preconditions

PyTorch 2.2+; CUDA 12.0+ for flash-attn library; sequence length >512 for benefit; float16 or lower precision

Failure modes

Numerical drift vs standard attention <1e-3; window_size misconfiguration breaks causal masking; H100 FP8 not supported in older flash-attn

Trust signals