Clarification on the AnyScale accuracy score benchmark
Hi Team,
We are working on publishing a comparison between different hallucination evaluation techniques (including our own). We have been able to reproduce your results on the TRUE and SummaC datasets. However, we were unable to reproduce the 86% accuracy results using the Anyscale Ranking test. I am looking for clarification on how you performed this test. My understanding is as follows:
- Start with the sentence pairs dataset as used in the Anyscale notebook.
- Use the
article_sent
column as the source of truth. - Check the
correct_sent
column against thearticle_sent
column for consistency - Repeat the same for the
incorrect_sent
column - Calculate accuracy based on the number of right answers. The dataset size is 373 rows that translates to 746 items (since we are using the
correct_sent
and theincorrect_sent
from each row).
Based on this method above, we are seeing an accuracy of 66.35% using the existing Vectara model. Let me know if I am missing something here or if you used a different dataset/methodology to calculate the accuracy.
Thank you!
Hello Preetam,
I was able to replicate the results Simon achieved. The key is the following sentence in the prompt used by Anyscale:
Decide which of the following summary is more consistent with the article sentence.
Note that consistency means all information in the summary is supported by the article.
Crucially, the LLMs are asked to make a relative comparison about consistency, not an absolute judgement. The former is an easier problem than the latter.
In other words, given two summaries, summary_good and summary_bad, the accuracy is the percentage of cases where hhem(summary_good) > hhem(summary_bad).
While investigating the issue you raised, I also realized that the following rows of the dataset are invalid because the correct and incorrect sentences are identical: 44, 180, 328.
Hi Amin,
Thank you for looking into this. This makes sense. It would be great to clarify this point in the model card that this is the accuracy on a relative comparison (more similar to ranking tasks).
Also, I think it will be great to open source the code for these benchmarks (either in a notebook or a library format). We are happy to do that - let me know if you are interested in collaborating on this.
Thanks,
Preetam
Also, I think it will be great to open source the code for these benchmarks (either in a notebook or a library format). We are happy to do that - let me know if you are interested in collaborating on this.
That sounds like an interesting idea. Which benchmarks were you specifically referring to? The three mentioned in the model card, or the Hallucination Leaderboard?
For context, Simon, the research lead, sadly passed away unexpectedly last November. Without him, our team has been stretched thin, and we haven't been able to actively push forward all his work.
The three mentioned in the model card, or the Hallucination Leaderboard?
Yes the three mentioned in the model card.
I did hear about the really unfortunate news about Simon. It was very sad to hear. I understand that you all are short on resourcing. Our team can take a first pass at this - we can sync offline to figure out how to collaborate, if possible.
That sounds good. You can reach me directly at [email protected]. I look forward to hearing from you there.