Train giant language models efficiently
training-llms-megatronskillsetup L4★9,423
Orchestra-Research/AI-Research-SKILLs ↗What it does
Train LLMs 2B-462B with distributed parallelism strategies
Best for
Training LLMs over 1B parameters where single GPU is insufficient; achieves 47% MFU on H100.
Inputs
- · model size (parameters)
- · GPU count
- · training hyperparameters (LR, batch, seq_len)
Outputs
- · training script with TP/PP/EP config
- · throughput metrics (tokens/sec/GPU)
Requires
- · megatron-core
- · torch
- · apex
- · transformer-engine
- · docker/SLURM
Preconditions
NVIDIA GPUs (H100+), PyTorch installed, familiarity with distributed training
Failure modes
- · OOM from incorrect parallelism split
- · low MFU if not tuned
- · inconsistent results with wrong reduce_dtype
Trust signals
- · 47% MFU benchmark stated
- · Nemotron/LLaMA/DeepSeek production use cited
- · tensor/pipeline/expert parallelism examples