Spaces:
Running
on
CPU Upgrade
Possible future contamination problem
If there are only a handful of example (questions) which can easily be answered by humans and you have released that to public what is stopping ranking seekers to contaminate their models by manually writing answers to them and then training models over that data?
Nothing.
However, manually answering the questions is (i) conceptually easy but also extremely tedious (ii) difficult to hide (we ask model owners to provide reasoning trace, scores might be suspicious etc.) (iii) not robust since we plan to renew the test set in case of contamination
I was thinking that you would have a higher number of questions and while mention that you have only 300 and even "leak" questions but strictly guard other questions and answers and not even mention how many are there.
In complement to @gregmialz 's very good answer, we actually need people to know what the questions from the test set are, so they can use their models on them and give us their answers :)