Compress large models without retraining
knowledge-distillationskillsetup L3★9,423
Orchestra-Research/AI-Research-SKILLs ↗What it does
Compress large LLMs via teacher-student distillation
Best for
Retaining 90%+ of large-model performance in smaller deployable student (70B→7B).
Inputs
- · teacher model (70B)
- · student model (7B)
- · training data
- · temperature (float)
- · alpha weight
Outputs
- · distilled student model
- · checkpoints
Requires
- · transformers
- · torch
- · deepspeed (optional)
- · accelerate
Preconditions
- · teacher loaded
- · student initialized
- · training data prepared
- · VRAM available
Failure modes
- · teacher OOM
- · slow convergence
- · student divergence
Trust signals
- · Reverse KLD (MiniLLM) for better mode coverage
- · temperature scaling proven
- · soft targets + hard loss combo
- · arXiv 2306.08543