Speed up transformer inference 2-4x
optimizing-attention-flashskillsetup L2★9,423
Orchestra-Research/AI-Research-SKILLs ↗What it does
Accelerate transformer attention 2-4x with Flash Attention on NVIDIA GPUs
Best for
Long-context LLM inference and training where attention is the bottleneck and 2-4x speedup with 10-20x memory savings is critical
Inputs
- · query/key/value tensors (batch, seqlen, nheads, headdim)
- · dropout_p, causal flag, window_size
Outputs
- · attention output tensor same shape as input
Requires
- · flash-attn
- · torch>=2.2
- · transformers
- · NVIDIA GPU with CUDA
Preconditions
PyTorch 2.2+; CUDA 12.0+ for flash-attn library; sequence length >512 for benefit; float16 or lower precision
Failure modes
Numerical drift vs standard attention <1e-3; window_size misconfiguration breaks causal masking; H100 FP8 not supported in older flash-attn
Trust signals
- · Flash-Attention papers (Dao et al.) published at top venues
- · PyTorch native SDPA backend validates algorithm
- · 10-20x memory reduction empirically shown