cyberneticlibrary

Scale RLHF training with Ray

openrlhf-trainingskillsetup L49,423
Orchestra-Research/AI-Research-SKILLs
What it does

Distribute RLHF training across multi-GPU clusters

Best for

Scaling PPO/GRPO/RLOO/DPO training to 70B+ models with multi-node vLLM.

Inputs
  • · Base model (7B-70B+)
  • · Preference dataset
  • · Reward model (optional)
  • · Ray cluster config
Outputs
  • · RLHF-trained checkpoint
  • · Training logs
  • · vLLM inference artifact
Requires
  • · OpenRLHF
  • · Ray
  • · vLLM
  • · DeepSpeed ZeRO-3
  • · torch
Preconditions
  • · Ray cluster running
  • · vLLM accessible
  • · Preference dataset ready
  • · GPU >= 24GB per node
Failure modes
  • · vLLM engine dies
  • · ZeRO-3 CPU RAM exhausted
  • · Preference format mismatch
  • · Colocate OOM
Trust signals
  • · Orchestra skill
  • · OpenRLHF with Ray/vLLM
  • · 2× faster than DeepSpeedChat
  • · Production-tested