errors in eval

#1
by amphora - opened

Hi, Im guijin the author of KMMLU.

KMMLU, by design, is built to be a four option mcqa benchmark implying that the minimum performance of a model, regardless of how bad it can be, is 25%.

While the readme acknowledges their may be errors in scores for qwen3, we see problematic to report such score. And is advising to fix if possible.

If error persists please contact us so that we may also try resolving together.

Konan Technology org

Thank you for your comment.

You’re right—KMMLU is a four-option MCQA benchmark, so scores shouldn’t fall below 25%.

After checking the logs and model card, I found that we evaluated it only in generative mode using the "kmmlu_direct" task in lm-evaluation-harness. This wasn’t clearly stated.

MMLU was run with the same setup (copied from "kmmlu_direct"), so we didn’t use the "mmlu_generative" task there either.

We’ll update the model card as soon as possible. In the meantime, we’ll add a note to avoid confusion.

Konan Technology org
This comment has been hidden (marked as Resolved)
momo changed discussion status to closed
momo changed discussion status to open
Konan Technology org

We would like to inform you that the reevaluation of the model has been completed.

Upon review, we identified the following issues in the previous evaluation:

  1. The inst model was evaluated under a 5-shot setting.
  2. The answers were not properly preprocessed prior to evaluation.

These oversights led to inaccuracies in the KMMLU scores. We sincerely apologize for any confusion this may have caused.

Sign up or log in to comment