Benchmark code generation models
evaluating-code-modelsskillsetup L3★9,423
Orchestra-Research/AI-Research-SKILLs ↗What it does
Benchmark code generation models across HumanEval, MBPP, and 15+ benchmarks with pass@k metrics
Best for
Comparing code model performance across standard benchmarks when new architecture or training method is evaluated
Inputs
- · model identifier (HF model or API)
- · benchmark suite (HumanEval, MBPP, MultiPL-E, etc.)
- · number of samples k
Outputs
- · pass@k score
- · pass@1/10/100
- · per-problem failure analysis
Requires
- · evalplus
- · humaneval
- · transformers
Preconditions
Model weights accessible or API token valid; benchmark data downloaded; Python environment set up
Failure modes
API rate limits if k too high; timeout if model inference slow; benchmark data corrupted
Trust signals
- · HumanEval and MBPP are standard code benchmarks
- · pass@k metric prevents inference cost gaming
- · MultiPL-E enables cross-language comparison