cyberneticlibrary

Distribute large model training across GPUs

pytorch-fsdp2skillsetup L3★9,423

Orchestra-Research/AI-Research-SKILLs ↗

What it does

Add fully_shard parameter distribution to PyTorch training

Best for

Sharding large models across GPUs with DTensor-based parameter sharding and simpler checkpoint semantics vs FSDP1.

Inputs

· PyTorch model
· target GPU count
· mixed precision policy config

Outputs

· FSDP2-wrapped model
· DTensor-aware optimizer
· DCP checkpoint helper

Requires

· torch (2.4+)
· Distributed Checkpoint (DCP)

Preconditions

PyTorch 2.4+, torchrun available, model fits on meta device

Failure modes

· top-down sharding only (not bottom-up)
· optimizer created before sharding
· naïve state_dict() loses DTensor structure

Trust signals

· User contract formalized (5 rules)
· bottom-up sharding pattern documented
· DeviceMesh integration shown