Fine-tune models with GRPO
grpo-rl-trainingskillsetup L3★9,423
Orchestra-Research/AI-Research-SKILLs ↗What it does
Fine-tune models with group relative rewards
Best for
Teaching specific output formats (XML, JSON) and verifiable tasks without preference pairs.
Inputs
- · Chat-format dataset
- · System prompt
- · Reward functions
- · Ground truth answers (optional)
Outputs
- · Fine-tuned model checkpoint
- · Training logs
- · Eval results
Requires
- · TRL>=0.14.0
- · transformers>=4.47.0
- · torch
- · PEFT
Preconditions
- · Reward functions tested independently
- · Chat format correct
- · Group size 4-16
- · GPU memory >= 40GB
Failure modes
- · Poorly-designed reward → gaming
- · Group size too small
- · Chat format wrong
- · Missing ground truth
Trust signals
- · Orchestra Research skill
- · TRL official GRPO
- · More sample-efficient than PPO
- · Tested on reasoning