cyberneticlibrary

Fine-tune models with GRPO

grpo-rl-trainingskillsetup L3★9,423

Orchestra-Research/AI-Research-SKILLs ↗

What it does

Fine-tune models with group relative rewards

Best for

Teaching specific output formats (XML, JSON) and verifiable tasks without preference pairs.

Inputs

· Chat-format dataset
· System prompt
· Reward functions
· Ground truth answers (optional)

Outputs

· Fine-tuned model checkpoint
· Training logs
· Eval results

Requires

· TRL>=0.14.0
· transformers>=4.47.0
· torch
· PEFT

Preconditions

· Reward functions tested independently
· Chat format correct
· Group size 4-16
· GPU memory >= 40GB

Failure modes

· Poorly-designed reward → gaming
· Group size too small
· Chat format wrong
· Missing ground truth

Trust signals

· Orchestra Research skill
· TRL official GRPO
· More sample-efficient than PPO
· Tested on reasoning