cyberneticlibrary

Scale LLM evaluation across backends

nemo-evaluator-sdkskillsetup L4★9,423

Orchestra-Research/AI-Research-SKILLs ↗

What it does

Evaluate LLMs across 100+ benchmarks using Slurm/Docker/cloud infrastructure

Best for

Enterprise benchmarking of multiple models at scale when reproducible containerized evaluation is required.

Inputs

· OpenAI-compatible model endpoint
· task configuration list
· Slurm account/partition (HPC only)

Outputs

· benchmark results YAML
· aggregated metrics
· MLflow-exportable results
· comparison tables

Requires

· NGC API key
· Docker runtime
· Slurm scheduler (HPC)
· NVIDIA cloud or self-hosted

Preconditions

Model must serve OpenAI-compatible API; HPC cluster access for large-scale; GPU nodes configured

Failure modes

· Invalid NGC API key
· Model endpoint timeout
· Slurm job allocation failure
· Insufficient GPU VRAM per node

Trust signals

· NVIDIA official product
· 100+ benchmarks from 18+ harnesses
· Container-first reproducibility
· Slurm/HPC support built-in