Train giant language models efficiently

training-llms-megatronskillsetup L49,423
Orchestra-Research/AI-Research-SKILLs
What it does

Train LLMs 2B-462B with distributed parallelism strategies

Best for

Training LLMs over 1B parameters where single GPU is insufficient; achieves 47% MFU on H100.

Inputs
  • · model size (parameters)
  • · GPU count
  • · training hyperparameters (LR, batch, seq_len)
Outputs
  • · training script with TP/PP/EP config
  • · throughput metrics (tokens/sec/GPU)
Requires
  • · megatron-core
  • · torch
  • · apex
  • · transformer-engine
  • · docker/SLURM
Preconditions

NVIDIA GPUs (H100+), PyTorch installed, familiarity with distributed training

Failure modes
  • · OOM from incorrect parallelism split
  • · low MFU if not tuned
  • · inconsistent results with wrong reduce_dtype
Trust signals
  • · 47% MFU benchmark stated
  • · Nemotron/LLaMA/DeepSeek production use cited
  • · tensor/pipeline/expert parallelism examples