cyberneticlibrary

Align models with SimPO

simpo-trainingskillsetup L39,423
Orchestra-Research/AI-Research-SKILLs
What it does

Train models with simple preference optimization

Best for

Quick preference optimization without reward model or RL infrastructure.

Inputs
  • · Chat-format dataset
  • · Preference pairs (chosen/rejected)
  • · Learning rate
  • · Batch size
Outputs
  • · Fine-tuned model checkpoint
  • · Training curves
  • · Eval metrics
Requires
  • · transformers
  • · torch
  • · datasets
Preconditions
  • · Dataset with chosen/rejected fields
  • · Base model specified
  • · GPU memory >= 16GB
Failure modes
  • · Preference labels conflicting
  • · Learning rate too high → divergence
  • · Batch size too small → high variance
Trust signals
  • · Simpler than PPO/DPO
  • · Direct preference pairs