Accelerate inference with speculative decoding
speculative-decodingskillsetup L3★9,423
Orchestra-Research/AI-Research-SKILLs ↗What it does
Accelerate LLM inference via draft verification
Best for
2-8x inference speedup with small draft model validated by large verifier.
Inputs
- · draft model (small)
- · verifier model (large)
- · prompt
Outputs
- · generated text
- · draft tokens validated
Requires
- · transformers
- · torch
Preconditions
- · both models loaded
- · compatible tokenizers
Failure modes
- · draft rejection too high
- · draft too slow
- · tokenizer mismatch
Trust signals
- · latency breakdown tracking
- · acceptance rate monitoring
- · Medusa (arXiv 2401.10774)