cyberneticlibrary

Scale RL training with VeRL

verl-rl-trainingskillsetup L4★9,423

Orchestra-Research/AI-Research-SKILLs ↗

What it does

Train LLMs with RL at scale using verl

Best for

Production math/reasoning tasks (GSM8K, MATH) where you need proven RL algorithms at scale.

Inputs

· Base model
· Training dataset (parquet with prompts)
· Reward function or model
· Config YAML

Outputs

· Trained policy model
· Training curves/metrics

Requires

· verl>=0.3.0
· torch>=2.0.0
· ray>=2.41.0
· vllm>=0.8.2
· transformers>=4.40.0

Preconditions

8+ GPUs; H100 or A100 recommended; models must be HuggingFace compatible

Failure modes

· Memory explosion on 671B+ models
· Misaligned reward function
· Distributed training synchronization failures

Trust signals

· Powers Doubao-1.5-pro
· EuroSys 2025 HybridFlow paper
· Tested up to 671B parameters