Pretrain LLMs at scale with 4D parallelism
distributed-llm-pretraining-torchtitanskillsetup L4★9,423
Orchestra-Research/AI-Research-SKILLs ↗What it does
Scale LLM pretraining to 1000s of GPUs using TorchTitan distributed framework
Best for
Large-scale pretraining where single-node limits are exceeded and distributed coordination is unavoidable.
Inputs
- · dataset path
- · model config
- · cluster spec (nodes/GPUs)
Outputs
- · trained model checkpoint
- · training logs
- · convergence curves
Requires
- · torchtitan
- · torch
- · transformers
- · ray/torchrun
Preconditions
- · multi-GPU cluster
- · NVIDIA GPUs
- · dataset on shared storage
Failure modes
- · communication overhead if network saturated
- · stragglers if nodes heterogeneous
Trust signals
- · Meta-tested on 2-4B parameter models
- · FSDP + pipeline parallelism
- · overlap compute/comms