cyberneticlibrary

Fine-tune models with GRPO

grpo-rl-trainingskillsetup L39,423
Orchestra-Research/AI-Research-SKILLs
What it does

Fine-tune models with group relative rewards

Best for

Teaching specific output formats (XML, JSON) and verifiable tasks without preference pairs.

Inputs
  • · Chat-format dataset
  • · System prompt
  • · Reward functions
  • · Ground truth answers (optional)
Outputs
  • · Fine-tuned model checkpoint
  • · Training logs
  • · Eval results
Requires
  • · TRL>=0.14.0
  • · transformers>=4.47.0
  • · torch
  • · PEFT
Preconditions
  • · Reward functions tested independently
  • · Chat format correct
  • · Group size 4-16
  • · GPU memory >= 40GB
Failure modes
  • · Poorly-designed reward → gaming
  • · Group size too small
  • · Chat format wrong
  • · Missing ground truth
Trust signals
  • · Orchestra Research skill
  • · TRL official GRPO
  • · More sample-efficient than PPO
  • · Tested on reasoning