cyberneticlibrary

Compress large models without retraining

knowledge-distillationskillsetup L39,423
Orchestra-Research/AI-Research-SKILLs
What it does

Compress large LLMs via teacher-student distillation

Best for

Retaining 90%+ of large-model performance in smaller deployable student (70B→7B).

Inputs
  • · teacher model (70B)
  • · student model (7B)
  • · training data
  • · temperature (float)
  • · alpha weight
Outputs
  • · distilled student model
  • · checkpoints
Requires
  • · transformers
  • · torch
  • · deepspeed (optional)
  • · accelerate
Preconditions
  • · teacher loaded
  • · student initialized
  • · training data prepared
  • · VRAM available
Failure modes
  • · teacher OOM
  • · slow convergence
  • · student divergence
Trust signals
  • · Reverse KLD (MiniLLM) for better mode coverage
  • · temperature scaling proven
  • · soft targets + hard loss combo
  • · arXiv 2306.08543