Align models with SimPO
simpo-trainingskillsetup L3★9,423
Orchestra-Research/AI-Research-SKILLs ↗What it does
Train models with simple preference optimization
Best for
Quick preference optimization without reward model or RL infrastructure.
Inputs
- · Chat-format dataset
- · Preference pairs (chosen/rejected)
- · Learning rate
- · Batch size
Outputs
- · Fine-tuned model checkpoint
- · Training curves
- · Eval metrics
Requires
- · transformers
- · torch
- · datasets
Preconditions
- · Dataset with chosen/rejected fields
- · Base model specified
- · GPU memory >= 16GB
Failure modes
- · Preference labels conflicting
- · Learning rate too high → divergence
- · Batch size too small → high variance
Trust signals
- · Simpler than PPO/DPO
- · Direct preference pairs