cyberneticlibrary

Run ablation study on hardest problems

mhpp-10-ablationworkflowsetup L34
ejentum/benchmarks
What it does

Ablate 10 hardest MHPP tasks with 3 conditions via agentic harness calls

Best for

Ablation studies where solver agents invoke external tools themselves (agentic-tool pattern, not pre-generation).

Inputs
  • · HuggingFace MHPP dataset
  • · 10 selected tasks
  • · 3 conditions: B/D/A
Outputs
  • · 30 solution codes
  • · per-condition pass rates
  • · results committed to GitHub
Requires
  • · HuggingFace datasets
  • · Ejentum /harness/ API
  • · gh CLI
  • · hidden test harness
Preconditions

MHPP fetchable; pre-registration committed; Ejentum API key available

Failure modes
  • · Dataset fetch fails (tries 2 sources + web search)
  • · Harness call times out → agent continues
  • · Hidden test fails → quarantine task
Trust signals
  • · PRE_REGISTRATION.md committed to repo before solve agents run
  • · Hidden tests enforce protocol
  • · 30 parallel agents amortize harness cost