Pretrain LLMs at scale with 4D parallelism

distributed-llm-pretraining-torchtitanskillsetup L49,423
Orchestra-Research/AI-Research-SKILLs
What it does

Scale LLM pretraining to 1000s of GPUs using TorchTitan distributed framework

Best for

Large-scale pretraining where single-node limits are exceeded and distributed coordination is unavoidable.

Inputs
  • · dataset path
  • · model config
  • · cluster spec (nodes/GPUs)
Outputs
  • · trained model checkpoint
  • · training logs
  • · convergence curves
Requires
  • · torchtitan
  • · torch
  • · transformers
  • · ray/torchrun
Preconditions
  • · multi-GPU cluster
  • · NVIDIA GPUs
  • · dataset on shared storage
Failure modes
  • · communication overhead if network saturated
  • · stragglers if nodes heterogeneous
Trust signals
  • · Meta-tested on 2-4B parameter models
  • · FSDP + pipeline parallelism
  • · overlap compute/comms