Scale LLM evaluation across backends
nemo-evaluator-sdkskillsetup L4★9,423
Orchestra-Research/AI-Research-SKILLs ↗What it does
Evaluate LLMs across 100+ benchmarks using Slurm/Docker/cloud infrastructure
Best for
Enterprise benchmarking of multiple models at scale when reproducible containerized evaluation is required.
Inputs
- · OpenAI-compatible model endpoint
- · task configuration list
- · Slurm account/partition (HPC only)
Outputs
- · benchmark results YAML
- · aggregated metrics
- · MLflow-exportable results
- · comparison tables
Requires
- · NGC API key
- · Docker runtime
- · Slurm scheduler (HPC)
- · NVIDIA cloud or self-hosted
Preconditions
Model must serve OpenAI-compatible API; HPC cluster access for large-scale; GPU nodes configured
Failure modes
- · Invalid NGC API key
- · Model endpoint timeout
- · Slurm job allocation failure
- · Insufficient GPU VRAM per node
Trust signals
- · NVIDIA official product
- · 100+ benchmarks from 18+ harnesses
- · Container-first reproducibility
- · Slurm/HPC support built-in