Run ablation study on hardest problems
mhpp-10-ablationworkflowsetup L3★4
ejentum/benchmarks ↗What it does
Ablate 10 hardest MHPP tasks with 3 conditions via agentic harness calls
Best for
Ablation studies where solver agents invoke external tools themselves (agentic-tool pattern, not pre-generation).
Inputs
- · HuggingFace MHPP dataset
- · 10 selected tasks
- · 3 conditions: B/D/A
Outputs
- · 30 solution codes
- · per-condition pass rates
- · results committed to GitHub
Requires
- · HuggingFace datasets
- · Ejentum /harness/ API
- · gh CLI
- · hidden test harness
Preconditions
MHPP fetchable; pre-registration committed; Ejentum API key available
Failure modes
- · Dataset fetch fails (tries 2 sources + web search)
- · Harness call times out → agent continues
- · Hidden test fails → quarantine task
Trust signals
- · PRE_REGISTRATION.md committed to repo before solve agents run
- · Hidden tests enforce protocol
- · 30 parallel agents amortize harness cost