Distribute large model training across GPUs

pytorch-fsdp2skillsetup L39,423
Orchestra-Research/AI-Research-SKILLs
What it does

Add fully_shard parameter distribution to PyTorch training

Best for

Sharding large models across GPUs with DTensor-based parameter sharding and simpler checkpoint semantics vs FSDP1.

Inputs
  • · PyTorch model
  • · target GPU count
  • · mixed precision policy config
Outputs
  • · FSDP2-wrapped model
  • · DTensor-aware optimizer
  • · DCP checkpoint helper
Requires
  • · torch (2.4+)
  • · Distributed Checkpoint (DCP)
Preconditions

PyTorch 2.4+, torchrun available, model fits on meta device

Failure modes
  • · top-down sharding only (not bottom-up)
  • · optimizer created before sharding
  • · naïve state_dict() loses DTensor structure
Trust signals
  • · User contract formalized (5 rules)
  • · bottom-up sharding pattern documented
  • · DeviceMesh integration shown