contamination...OpenMathInstruct-2

#1
by fblgit - opened

@bond005
I believe that OpenMathInstruct-2 is part of the training for this model, which unfortunately seems to be contaminated.

Indistinctly of which dataset is part of your training, the model weights contamination is a fact. But TBH, the numbers & samples involved are the same as the previous case

According contamination benchmarks:

  • 200~ MATH tests were EXTRA contaminated
  • 35~ MATH_HARD tests were EXTRA contaminated

Contamination tests for base model:

MATH_rewritten-test-1 5_gram_accuracy:  0.25320000000000004
MATH_rewritten-test-2 5_gram_accuracy:  0.2690666666666667
MATH_rewritten-test-3 5_gram_accuracy:  0.2692
orgn-MATH-test 5_gram_accuracy:  0.27053333333333335
ngram acc of Qwen2.5-1.5B-Instruct
MATH_rewritten-test-1: 0.25320000000000004
MATH_rewritten-test-2: 0.2690666666666667
MATH_rewritten-test-3: 0.2692
orgn-MATH-test: 0.27053333333333335
...
GSM8K_rewritten-test-1 5_gram_accuracy:  0.21971190295678544
GSM8K_rewritten-test-2 5_gram_accuracy:  0.2227445034116755
GSM8K_rewritten-test-3 5_gram_accuracy:  0.2172858225928734
orgn-GSM8K-test 5_gram_accuracy:  0.23290371493555728
GSM8K_rewritten-test-1: 0.21971190295678544
GSM8K_rewritten-test-2: 0.2227445034116755
GSM8K_rewritten-test-3: 0.2172858225928734
orgn-GSM8K-test: 0.23290371493555728

Contamination tests for this model:

MATH_rewritten-test-1 5_gram_accuracy:  0.3384666666666667
MATH_rewritten-test-2 5_gram_accuracy:  0.3502666666666667
MATH_rewritten-test-3 5_gram_accuracy:  0.3504666666666667
orgn-MATH-test 5_gram_accuracy:  0.3519333333333334
ngram acc of meno-tiny-0.1
MATH_rewritten-test-1: 0.3384666666666667
MATH_rewritten-test-2: 0.3502666666666667
MATH_rewritten-test-3: 0.3504666666666667
orgn-MATH-test: 0.3519333333333334
...
GSM8K_rewritten-test-1 5_gram_accuracy:  0.23320697498104626
GSM8K_rewritten-test-2 5_gram_accuracy:  0.2400303260045489
GSM8K_rewritten-test-3 5_gram_accuracy:  0.23290371493555728
orgn-GSM8K-test 5_gram_accuracy:  0.26277482941622443
GSM8K_rewritten-test-1: 0.23320697498104626
GSM8K_rewritten-test-2: 0.2400303260045489
GSM8K_rewritten-test-3: 0.23290371493555728
orgn-GSM8K-test: 0.26277482941622443

The reproduction is simple:

@fblgit

Hi!

Thank you for your comment. However, I didn't use nvidia/OpenMathInstruct-2 for training.

My training dataset consisted of many separate datasets in Russian and English, which can be divided into three groups:

  1. Fully synthetic datasets generated by me using a large model.

  2. Datasets automatically translated from English to Russian, focused on solving mathematical and logical problems.

  3. Russian-language datasets obtained based on NLP tasks for the Russian language (paraphrasing, summarization, etc.).

In the second group, there were the TIGER-Lab/MathInstruct and KK04/LogicInference_OA datasets, which I translated into Russian using NLLB-200-3.3B, followed by automated error checking and translation hallucination detection. However, the TIGER-Lab/MathInstruct dataset and the nvidia/OpenMathInstruct-2 dataset are different datasets, as far as I understand, even though they belong to the same subject area.

Indistinctly of which dataset is part of your training, the model weights contamination is a fact.

Sign up or log in to comment