Possible Error in HallusionBench Score Reporting
#1
by
jaycha
- opened
It seems that the HallusionBench score may have been reported as aAcc.
IMO it would be better either to report the average of aAcc, qAcc, and fAcc (as done in OpenCompass) or to explicitly state that the reported score represents aAcc.
Thanks :)
Thanks a lot for your interest and feedback!
Our reported score reflected aAcc, and we have now updated the results to show the average of aAcc, qAcc, and fAcc as you suggested.
Please refer to the detailed results below.
Before
A.X 4.0 VL Light | Qwen2.5-VL-7B | InternVL3-8B | VARCO-VISION-2.0-14B | Qwen2.5-VL-32B | |
---|---|---|---|---|---|
HallusionBench | 69.6 | 70.2 | 66.3 | 70.4 | 72.0 |
After
A.X 4.0 VL Light | Qwen2.5-VL-7B | InternVL3-8B | VARCO-VISION-2.0-14B | Qwen2.5-VL-32B | |
---|---|---|---|---|---|
HallusionBench | 54.2 | 52.7 | 49.6 | 53.8 | 58.0 |
liveseongho
changed discussion status to
closed