cyberneticlibrary

Scale neural network training

pytorch-lightningskillsetup L327,559
K-Dense-AI/scientific-agent-skills
What it does

Train neural networks with distributed training and experiment tracking

Best for

When you want to scale PyTorch training across GPUs without rewriting boilerplate.

Inputs
  • · PyTorch model
  • · DataLoader
  • · loss function
  • · optimizer
Outputs
  • · trained model checkpoint
  • · metrics logs
  • · inference-ready weights
Requires
  • · Python
  • · PyTorch Lightning
  • · PyTorch
  • · optional: Tensorboard / Weights & Biases
Preconditions
  • · data in DataLoader format
  • · model inherits LightningModule
Failure modes
  • · out-of-memory during distributed training
  • · learning rate too high/low
Trust signals
  • · automatic mixed precision (AMP)
  • · checkpoint/resume
  • · gradient accumulation