Spaces:
Running
on
CPU Upgrade
[FLAG] contamination MATH `bond005/meno-tiny-0.1`
@bond005
@clefourrier
I believe that OpenMathInstruct-2 is part of the training for this model, which unfortunately seems to be contaminated.
Indistinctly of which dataset is part of the training, the model weights contamination is a fact. But TBH, the numbers & samples involved are the same as the previous case
According contamination benchmarks:
- 200~ MATH tests were EXTRA contaminated
- 35~ MATH_HARD tests were EXTRA contaminated
Contamination tests for base model:
MATH_rewritten-test-1 5_gram_accuracy: 0.25320000000000004
MATH_rewritten-test-2 5_gram_accuracy: 0.2690666666666667
MATH_rewritten-test-3 5_gram_accuracy: 0.2692
orgn-MATH-test 5_gram_accuracy: 0.27053333333333335
ngram acc of Qwen2.5-1.5B-Instruct
MATH_rewritten-test-1: 0.25320000000000004
MATH_rewritten-test-2: 0.2690666666666667
MATH_rewritten-test-3: 0.2692
orgn-MATH-test: 0.27053333333333335
...
GSM8K_rewritten-test-1 5_gram_accuracy: 0.21971190295678544
GSM8K_rewritten-test-2 5_gram_accuracy: 0.2227445034116755
GSM8K_rewritten-test-3 5_gram_accuracy: 0.2172858225928734
orgn-GSM8K-test 5_gram_accuracy: 0.23290371493555728
GSM8K_rewritten-test-1: 0.21971190295678544
GSM8K_rewritten-test-2: 0.2227445034116755
GSM8K_rewritten-test-3: 0.2172858225928734
orgn-GSM8K-test: 0.23290371493555728
Contamination tests for this model:
MATH_rewritten-test-1 5_gram_accuracy: 0.3384666666666667
MATH_rewritten-test-2 5_gram_accuracy: 0.3502666666666667
MATH_rewritten-test-3 5_gram_accuracy: 0.3504666666666667
orgn-MATH-test 5_gram_accuracy: 0.3519333333333334
ngram acc of meno-tiny-0.1
MATH_rewritten-test-1: 0.3384666666666667
MATH_rewritten-test-2: 0.3502666666666667
MATH_rewritten-test-3: 0.3504666666666667
orgn-MATH-test: 0.3519333333333334
...
GSM8K_rewritten-test-1 5_gram_accuracy: 0.23320697498104626
GSM8K_rewritten-test-2 5_gram_accuracy: 0.2400303260045489
GSM8K_rewritten-test-3 5_gram_accuracy: 0.23290371493555728
orgn-GSM8K-test 5_gram_accuracy: 0.26277482941622443
GSM8K_rewritten-test-1: 0.23320697498104626
GSM8K_rewritten-test-2: 0.2400303260045489
GSM8K_rewritten-test-3: 0.23290371493555728
orgn-GSM8K-test: 0.26277482941622443
The reproduction is simple:
- https://github.com/GAIR-NLP/benbench
- modify the src/script to use the model, and the test to be
math
orgsm8k
- run and get the results
Hi! Thanks for the flag!
As usual, let's wait for messages from the authors and see if they are aware of the issue, and in the absence of an answer we'll flag :)
Hi!
Thank you for your comment. However, I didn't use nvidia/OpenMathInstruct-2 for training.
My training dataset consisted of many separate datasets in Russian and English, which can be divided into three groups:
Fully synthetic datasets generated by me using a large model.
Datasets automatically translated from English to Russian, focused on solving mathematical and logical problems.
Russian-language datasets obtained based on NLP tasks for the Russian language (paraphrasing, summarization, etc.).
In the second group, there were the TIGER-Lab/MathInstruct and KK04/LogicInference_OA datasets, which I translated into Russian using NLLB-200-3.3B, followed by automated error checking and translation hallucination detection. However, the TIGER-Lab/MathInstruct dataset and the nvidia/OpenMathInstruct-2 dataset are different datasets, as far as I understand, even though they belong to the same subject area.
Which dataset is actually brining the contamination is something that you can further investigate.
Indistinctly of which dataset is part of your training, the model weights contamination is a fact.