AIME2024 has 30 Tests - Cant score 80.96
#36
by
fblgit
- opened
Hi there... Im not sure how this AIME2024 evaluation has been done.. but AIME2024 has 15 & 15 tests.
R1 scored 79.8/80 meaning, 24 out of 30 samples were answered correctly.
If this model has outperformed R1, means it answered correctly 25 questions.. scoring 83.3..
What kind of AIME2024 has been used that is able to produce fractions of the questions?
I refer to:
- https://huggingface.co/datasets/Maxwell-Jia/AIME_2024
- https://artofproblemsolving.com/wiki/index.php/2024_AIME_I
- https://artofproblemsolving.com/wiki/index.php/2024_AIME_II
It also feels very strange being able to improve a result on this test while degrading the majority of the known benchmarks..
How many right answers this model got on AIME to score 80.96 ?
Because R1 got 24 right answers to score his benchmark.