Incorrect ifeval benchmark

#879
by DavidGF - opened

Hello everyone,

Apparently the ifeval evaluation went wrong with our model. Unfortunately I can't explain how this could have happened. As you can see in the ifeval benchmark result data set, most responses are simply empty. In our internal tests (based on the HF leaderboard documentation) everything worked correctly.
You can also see from the remaining benchmarks that the ifeval value cannot have been calculated correctly, as all other values ​​are similar to our internal tests (we have also stored diagrams in the model card where you can check this again)

https://huggingface.co/datasets/open-llm-leaderboard/VAGOsolutions__SauerkrautLM-gemma-2-2b-it-details/viewer/VAGOsolutions__SauerkrautLM-gemma-2-2b-it__leaderboard_ifeval?row=16

Do you have an idea or even a solution?

Thanks in advance,
David

Open LLM Leaderboard org

Hi @DavidGF ,

Thank you for reporting this issue! We will need some time to check it, I will get back to you as soon as I get more info

Open LLM Leaderboard org

Hi @DavidGF ,

It looks like the issue with your model’s responses to the IFEval benchmark is challenging to pinpoint. Sometimes the model responds as expected, but other times it doesn't, particularly with more complex prompts.

From what I can see, everything seems set up correctly, like the BOS token and the chat template, so the problem might be related to how the model treats the specific generation settings. These settings might be making the model too rigid, which could explain why it occasionally fails to generate a response. Plus, I've tried to re-evaluate your model and got the same results. Have you also tried evaluating your model with the added BOS token?

Hello @alozowski ,
First of all, thank you very much for your efforts!
We have also evaluated the model several times with lm eval harness and have not had any problems.
If it was only specific to our model, then the behavior should not automatically occur in the other models I mentioned.
A lot of Gemma 2 finetunes are affected by this behavior.
The results of the rest of the benchmark also show that the model performs well and I therefore do not assume that it is overwhelmed by the complexity of certain ifeval prompts.

Open LLM Leaderboard org

Hi @DavidGF ,

Thank you for the additional context! I need more time to investigate this issue, but I'll get back to you as soon as I have more info or a potential solution.

Open LLM Leaderboard org

Hi @DavidGF ,

After manual inspection of the different outputs (and local re-runs of the model), we are not able to pinpoint a specific place where this failure could come from, as we're getting the same results consistently.

Could you share

  1. the command you are using to run the harness locally? (notably, are you using the same command as the one indicated in our Reproducibility section in the doc page? including few_shot_as_multiturn?)
  2. a detailed result file of one of your runs?

Thanks a lot for your answer, it would help us debug this way faster!

Open LLM Leaderboard org

Closing for inactivity

clefourrier changed discussion status to closed

Sign up or log in to comment