Scale LLM evaluation across backends

nemo-evaluator-sdkskillsetup L49,423
Orchestra-Research/AI-Research-SKILLs
What it does

Evaluate LLMs across 100+ benchmarks using Slurm/Docker/cloud infrastructure

Best for

Enterprise benchmarking of multiple models at scale when reproducible containerized evaluation is required.

Inputs
  • · OpenAI-compatible model endpoint
  • · task configuration list
  • · Slurm account/partition (HPC only)
Outputs
  • · benchmark results YAML
  • · aggregated metrics
  • · MLflow-exportable results
  • · comparison tables
Requires
  • · NGC API key
  • · Docker runtime
  • · Slurm scheduler (HPC)
  • · NVIDIA cloud or self-hosted
Preconditions

Model must serve OpenAI-compatible API; HPC cluster access for large-scale; GPU nodes configured

Failure modes
  • · Invalid NGC API key
  • · Model endpoint timeout
  • · Slurm job allocation failure
  • · Insufficient GPU VRAM per node
Trust signals
  • · NVIDIA official product
  • · 100+ benchmarks from 18+ harnesses
  • · Container-first reproducibility
  • · Slurm/HPC support built-in