Spaces:

open-llm-leaderboard
/

open_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

1146

[FLAG] contamination MATH `bond005/meno-tiny-0.1`

#1094

by fblgit - opened Feb 8

Discussion

fblgit

Feb 8

@bond005 @clefourrier
I believe that OpenMathInstruct-2 is part of the training for this model, which unfortunately seems to be contaminated.

Indistinctly of which dataset is part of the training, the model weights contamination is a fact. But TBH, the numbers & samples involved are the same as the previous case

According contamination benchmarks:

200~ MATH tests were EXTRA contaminated
35~ MATH_HARD tests were EXTRA contaminated

Contamination tests for base model:

MATH_rewritten-test-1 5_gram_accuracy:  0.25320000000000004
MATH_rewritten-test-2 5_gram_accuracy:  0.2690666666666667
MATH_rewritten-test-3 5_gram_accuracy:  0.2692
orgn-MATH-test 5_gram_accuracy:  0.27053333333333335
ngram acc of Qwen2.5-1.5B-Instruct
MATH_rewritten-test-1: 0.25320000000000004
MATH_rewritten-test-2: 0.2690666666666667
MATH_rewritten-test-3: 0.2692
orgn-MATH-test: 0.27053333333333335
...
GSM8K_rewritten-test-1 5_gram_accuracy:  0.21971190295678544
GSM8K_rewritten-test-2 5_gram_accuracy:  0.2227445034116755
GSM8K_rewritten-test-3 5_gram_accuracy:  0.2172858225928734
orgn-GSM8K-test 5_gram_accuracy:  0.23290371493555728
GSM8K_rewritten-test-1: 0.21971190295678544
GSM8K_rewritten-test-2: 0.2227445034116755
GSM8K_rewritten-test-3: 0.2172858225928734
orgn-GSM8K-test: 0.23290371493555728

Contamination tests for this model:

MATH_rewritten-test-1 5_gram_accuracy:  0.3384666666666667
MATH_rewritten-test-2 5_gram_accuracy:  0.3502666666666667
MATH_rewritten-test-3 5_gram_accuracy:  0.3504666666666667
orgn-MATH-test 5_gram_accuracy:  0.3519333333333334
ngram acc of meno-tiny-0.1
MATH_rewritten-test-1: 0.3384666666666667
MATH_rewritten-test-2: 0.3502666666666667
MATH_rewritten-test-3: 0.3504666666666667
orgn-MATH-test: 0.3519333333333334
...
GSM8K_rewritten-test-1 5_gram_accuracy:  0.23320697498104626
GSM8K_rewritten-test-2 5_gram_accuracy:  0.2400303260045489
GSM8K_rewritten-test-3 5_gram_accuracy:  0.23290371493555728
orgn-GSM8K-test 5_gram_accuracy:  0.26277482941622443
GSM8K_rewritten-test-1: 0.23320697498104626
GSM8K_rewritten-test-2: 0.2400303260045489
GSM8K_rewritten-test-3: 0.23290371493555728
orgn-GSM8K-test: 0.26277482941622443

The reproduction is simple:

https://github.com/GAIR-NLP/benbench
modify the src/script to use the model, and the test to be math or gsm8k
run and get the results

clefourrier

Open LLM Leaderboard org Feb 12

Hi! Thanks for the flag!
As usual, let's wait for messages from the authors and see if they are aware of the issue, and in the absence of an answer we'll flag :)

bond005

Feb 18

@fblgit @clefourrier

Hi!

Thank you for your comment. However, I didn't use nvidia/OpenMathInstruct-2 for training.

My training dataset consisted of many separate datasets in Russian and English, which can be divided into three groups:

Fully synthetic datasets generated by me using a large model.
Datasets automatically translated from English to Russian, focused on solving mathematical and logical problems.
Russian-language datasets obtained based on NLP tasks for the Russian language (paraphrasing, summarization, etc.).

In the second group, there were the TIGER-Lab/MathInstruct and KK04/LogicInference_OA datasets, which I translated into Russian using NLLB-200-3.3B, followed by automated error checking and translation hallucination detection. However, the TIGER-Lab/MathInstruct dataset and the nvidia/OpenMathInstruct-2 dataset are different datasets, as far as I understand, even though they belong to the same subject area.

fblgit

Feb 18

Which dataset is actually brining the contamination is something that you can further investigate.

Indistinctly of which dataset is part of your training, the model weights contamination is a fact.

clefourrier changed discussion status to closed Mar 13

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment