cyberneticlibrary

Benchmark code generation models

evaluating-code-modelsskillsetup L39,423
Orchestra-Research/AI-Research-SKILLs
What it does

Benchmark code generation models across HumanEval, MBPP, and 15+ benchmarks with pass@k metrics

Best for

Comparing code model performance across standard benchmarks when new architecture or training method is evaluated

Inputs
  • · model identifier (HF model or API)
  • · benchmark suite (HumanEval, MBPP, MultiPL-E, etc.)
  • · number of samples k
Outputs
  • · pass@k score
  • · pass@1/10/100
  • · per-problem failure analysis
Requires
  • · evalplus
  • · humaneval
  • · transformers
Preconditions

Model weights accessible or API token valid; benchmark data downloaded; Python environment set up

Failure modes

API rate limits if k too high; timeout if model inference slow; benchmark data corrupted

Trust signals
  • · HumanEval and MBPP are standard code benchmarks
  • · pass@k metric prevents inference cost gaming
  • · MultiPL-E enables cross-language comparison