Distribute large model training across GPUs
pytorch-fsdp2skillsetup L3★9,423
Orchestra-Research/AI-Research-SKILLs ↗What it does
Add fully_shard parameter distribution to PyTorch training
Best for
Sharding large models across GPUs with DTensor-based parameter sharding and simpler checkpoint semantics vs FSDP1.
Inputs
- · PyTorch model
- · target GPU count
- · mixed precision policy config
Outputs
- · FSDP2-wrapped model
- · DTensor-aware optimizer
- · DCP checkpoint helper
Requires
- · torch (2.4+)
- · Distributed Checkpoint (DCP)
Preconditions
PyTorch 2.4+, torchrun available, model fits on meta device
Failure modes
- · top-down sharding only (not bottom-up)
- · optimizer created before sharding
- · naïve state_dict() loses DTensor structure
Trust signals
- · User contract formalized (5 rules)
- · bottom-up sharding pattern documented
- · DeviceMesh integration shown