Spaces:

open-llm-leaderboard
/

open_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

1144

Incorrect ifeval benchmark

#879

by DavidGF - opened Aug 12, 2024

Discussion

DavidGF

Aug 12, 2024

Hello everyone,

Apparently the ifeval evaluation went wrong with our model. Unfortunately I can't explain how this could have happened. As you can see in the ifeval benchmark result data set, most responses are simply empty. In our internal tests (based on the HF leaderboard documentation) everything worked correctly.
You can also see from the remaining benchmarks that the ifeval value cannot have been calculated correctly, as all other values are similar to our internal tests (we have also stored diagrams in the model card where you can check this again)

https://huggingface.co/datasets/open-llm-leaderboard/VAGOsolutions__SauerkrautLM-gemma-2-2b-it-details/viewer/VAGOsolutions__SauerkrautLM-gemma-2-2b-it__leaderboard_ifeval?row=16

Do you have an idea or even a solution?

Thanks in advance,
David

DavidGF

Aug 12, 2024

The same behavior can also be observed with the 9b finetunes of gemma 2:

https://huggingface.co/datasets/open-llm-leaderboard/UCLA-AGI__Gemma-2-9B-It-SPPO-Iter3-details/viewer/UCLA-AGI__Gemma-2-9B-It-SPPO-Iter3__leaderboard_ifeval
https://huggingface.co/datasets/open-llm-leaderboard/princeton-nlp__gemma-2-9b-it-SimPO-details/viewer/princeton-nlp__gemma-2-9b-it-SimPO__leaderboard_ifeval

alozowski

Open LLM Leaderboard org Aug 13, 2024

Hi @DavidGF ,

Thank you for reporting this issue! We will need some time to check it, I will get back to you as soon as I get more info

alozowski

Open LLM Leaderboard org Aug 26, 2024

Hi @DavidGF ,

It looks like the issue with your model’s responses to the IFEval benchmark is challenging to pinpoint. Sometimes the model responds as expected, but other times it doesn't, particularly with more complex prompts.

From what I can see, everything seems set up correctly, like the BOS token and the chat template, so the problem might be related to how the model treats the specific generation settings. These settings might be making the model too rigid, which could explain why it occasionally fails to generate a response. Plus, I've tried to re-evaluate your model and got the same results. Have you also tried evaluating your model with the added BOS token?

DavidGF

Aug 26, 2024

Hello @alozowski ,
First of all, thank you very much for your efforts!
We have also evaluated the model several times with lm eval harness and have not had any problems.
If it was only specific to our model, then the behavior should not automatically occur in the other models I mentioned.
A lot of Gemma 2 finetunes are affected by this behavior.
The results of the rest of the benchmark also show that the model performs well and I therefore do not assume that it is overwhelmed by the complexity of certain ifeval prompts.

alozowski

Open LLM Leaderboard org Aug 28, 2024

Hi @DavidGF ,

Thank you for the additional context! I need more time to investigate this issue, but I'll get back to you as soon as I have more info or a potential solution.

clefourrier

Open LLM Leaderboard org Sep 25, 2024

Hi @DavidGF ,

After manual inspection of the different outputs (and local re-runs of the model), we are not able to pinpoint a specific place where this failure could come from, as we're getting the same results consistently.

Could you share

the command you are using to run the harness locally? (notably, are you using the same command as the one indicated in our Reproducibility section in the doc page? including few_shot_as_multiturn?)
a detailed result file of one of your runs?

Thanks a lot for your answer, it would help us debug this way faster!

clefourrier

Open LLM Leaderboard org Oct 17, 2024

Closing for inactivity

clefourrier changed discussion status to closed Oct 17, 2024

williamdarling

Mar 7

Hi @DavidGF , I was wondering if (even though this is several months later) you could update the thread to note if the issue was ever resolved. We're experiencing a very similar issue where locally everything runs fine but HF's running of the leaderboard eval has resulted in several prompts getting empty responses leading to a lower score than expected.

Thank you!

DavidGF

Mar 8

Hi @williamdarling ,
We didn't solve it due to time constraints.
However, I'm pretty sure it has something to do with the tokenizer or the tokenizer.config.
The unsloth version of the gemma model didn't have these problems in the leaderboard.

Regards,
David

williamdarling

Mar 11

Thank you David. We're continuing to investigate and I'll follow up if we find something actionable.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment