errors in eval
Hi, Im guijin the author of KMMLU.
KMMLU, by design, is built to be a four option mcqa benchmark implying that the minimum performance of a model, regardless of how bad it can be, is 25%.
While the readme acknowledges their may be errors in scores for qwen3, we see problematic to report such score. And is advising to fix if possible.
If error persists please contact us so that we may also try resolving together.
Thank you for your comment.
You’re right—KMMLU is a four-option MCQA benchmark, so scores shouldn’t fall below 25%.
After checking the logs and model card, I found that we evaluated it only in generative mode using the "kmmlu_direct" task in lm-evaluation-harness. This wasn’t clearly stated.
MMLU was run with the same setup (copied from "kmmlu_direct"), so we didn’t use the "mmlu_generative" task there either.
We’ll update the model card as soon as possible. In the meantime, we’ll add a note to avoid confusion.
We would like to inform you that the reevaluation of the model has been completed.
Upon review, we identified the following issues in the previous evaluation:
- The inst model was evaluated under a 5-shot setting.
- The answers were not properly preprocessed prior to evaluation.
These oversights led to inaccuracies in the KMMLU scores. We sincerely apologize for any confusion this may have caused.