Benchmark code generation models

evaluating-code-modelsskillsetup L3★9,423

What it does

Benchmark code generation models across HumanEval, MBPP, and 15+ benchmarks with pass@k metrics

Best for

Comparing code model performance across standard benchmarks when new architecture or training method is evaluated

Inputs

Outputs

Requires

Preconditions

Model weights accessible or API token valid; benchmark data downloaded; Python environment set up

Failure modes

API rate limits if k too high; timeout if model inference slow; benchmark data corrupted

Trust signals