cyberneticlibrary

Scale RLHF training with Ray

openrlhf-trainingskillsetup L4★9,423

Orchestra-Research/AI-Research-SKILLs ↗

What it does

Distribute RLHF training across multi-GPU clusters

Best for

Scaling PPO/GRPO/RLOO/DPO training to 70B+ models with multi-node vLLM.

Inputs

· Base model (7B-70B+)
· Preference dataset
· Reward model (optional)
· Ray cluster config

Outputs

· RLHF-trained checkpoint
· Training logs
· vLLM inference artifact

Requires

· OpenRLHF
· Ray
· vLLM
· DeepSpeed ZeRO-3
· torch

Preconditions

· Ray cluster running
· vLLM accessible
· Preference dataset ready
· GPU >= 24GB per node

Failure modes

· vLLM engine dies
· ZeRO-3 CPU RAM exhausted
· Preference format mismatch
· Colocate OOM

Trust signals

· Orchestra skill
· OpenRLHF with Ray/vLLM
· 2× faster than DeepSpeedChat
· Production-tested