Possible Error in HallusionBench Score Reporting

#1
by jaycha - opened

It seems that the HallusionBench score may have been reported as aAcc.
IMO it would be better either to report the average of aAcc, qAcc, and fAcc (as done in OpenCompass) or to explicitly state that the reported score represents aAcc.
Thanks :)

SK Telecom org

Thanks a lot for your interest and feedback!

Our reported score reflected aAcc, and we have now updated the results to show the average of aAcc, qAcc, and fAcc as you suggested.

Please refer to the detailed results below.

Before

A.X 4.0 VL Light Qwen2.5-VL-7B InternVL3-8B VARCO-VISION-2.0-14B Qwen2.5-VL-32B
HallusionBench 69.6 70.2 66.3 70.4 72.0

After

A.X 4.0 VL Light Qwen2.5-VL-7B InternVL3-8B VARCO-VISION-2.0-14B Qwen2.5-VL-32B
HallusionBench 54.2 52.7 49.6 53.8 58.0
liveseongho changed discussion status to closed

Sign up or log in to comment