Train sparse Mixture of Experts models
moe-trainingskillsetup L4★9,423
Orchestra-Research/AI-Research-SKILLs ↗What it does
Train mixture-of-experts models for parameter efficiency
Best for
Scaling model capacity without proportional compute via conditional expert routing.
Inputs
- · base model
- · num_experts (int)
- · expert_dim
- · training data
Outputs
- · trained MoE model
- · gate metrics
- · expert utilization
Requires
- · transformers
- · torch
- · deepspeed
Preconditions
- · base model loaded
- · training data prepared
Failure modes
- · expert collapse (all tokens→1 expert)
- · load imbalance
- · routing instability
Trust signals
- · expert load balancing metrics
- · router convergence tracking
- · auxiliary loss prevents collapse