Distribute model training across clusters
ray-trainskillsetup L3★9,423
Orchestra-Research/AI-Research-SKILLs ↗What it does
Distribute training workloads across multi-GPU/multi-node clusters with fault tolerance
Best for
Running hyperparameter searches or long training runs across unstable/cloud clusters with automatic fault recovery.
Inputs
- · training script
- · cluster config (num_workers, GPUs per worker)
Outputs
- · distributed training run
- · fault-recovered checkpoint
Requires
- · ray
- · ray-tune
- · torch/tensorflow
Preconditions
Ray cluster started, training script compatible with distributed backend
Failure modes
- · worker crash without checkpoint
- · straggler nodes slow overall time
- · communication overhead dominates
Trust signals
- · fault tolerance mechanism explained
- · multi-backend support (torch/tf/jax)
- · cluster orchestration shown