cyberneticlibrary

Train sparse Mixture of Experts models

moe-trainingskillsetup L49,423
Orchestra-Research/AI-Research-SKILLs
What it does

Train mixture-of-experts models for parameter efficiency

Best for

Scaling model capacity without proportional compute via conditional expert routing.

Inputs
  • · base model
  • · num_experts (int)
  • · expert_dim
  • · training data
Outputs
  • · trained MoE model
  • · gate metrics
  • · expert utilization
Requires
  • · transformers
  • · torch
  • · deepspeed
Preconditions
  • · base model loaded
  • · training data prepared
Failure modes
  • · expert collapse (all tokens→1 expert)
  • · load imbalance
  • · routing instability
Trust signals
  • · expert load balancing metrics
  • · router convergence tracking
  • · auxiliary loss prevents collapse