cyberneticlibrary

Accelerate inference with speculative decoding

speculative-decodingskillsetup L39,423
Orchestra-Research/AI-Research-SKILLs
What it does

Accelerate LLM inference via draft verification

Best for

2-8x inference speedup with small draft model validated by large verifier.

Inputs
  • · draft model (small)
  • · verifier model (large)
  • · prompt
Outputs
  • · generated text
  • · draft tokens validated
Requires
  • · transformers
  • · torch
Preconditions
  • · both models loaded
  • · compatible tokenizers
Failure modes
  • · draft rejection too high
  • · draft too slow
  • · tokenizer mismatch
Trust signals
  • · latency breakdown tracking
  • · acceptance rate monitoring
  • · Medusa (arXiv 2401.10774)