This is a collection of datasets used to evaluate language models in the task of ablation planning in empirical AI research.
-
ai-coscientist/researcher-ablation-bench
Viewer • Updated • 83 • 70 -
ai-coscientist/reviewer-ablation-bench
Viewer • Updated • 6.26k • 32 -
ai-coscientist/researcher-ablation-judge-eval
Viewer • Updated • 63 • 57 -
ai-coscientist/reviewer-ablation-judge-eval
Viewer • Updated • 60 • 52