Scale RLHF training with Ray
openrlhf-trainingskillsetup L4★9,423
Orchestra-Research/AI-Research-SKILLs ↗What it does
Distribute RLHF training across multi-GPU clusters
Best for
Scaling PPO/GRPO/RLOO/DPO training to 70B+ models with multi-node vLLM.
Inputs
- · Base model (7B-70B+)
- · Preference dataset
- · Reward model (optional)
- · Ray cluster config
Outputs
- · RLHF-trained checkpoint
- · Training logs
- · vLLM inference artifact
Requires
- · OpenRLHF
- · Ray
- · vLLM
- · DeepSpeed ZeRO-3
- · torch
Preconditions
- · Ray cluster running
- · vLLM accessible
- · Preference dataset ready
- · GPU >= 24GB per node
Failure modes
- · vLLM engine dies
- · ZeRO-3 CPU RAM exhausted
- · Preference format mismatch
- · Colocate OOM
Trust signals
- · Orchestra skill
- · OpenRLHF with Ray/vLLM
- · 2× faster than DeepSpeedChat
- · Production-tested