Spaces:

flowers-team
/

StickToYourRoleLeaderboard

Running

grg commited on 7 days ago

Commit

846ce90

1 Parent(s): bce438f

update methodology with CoT

Files changed (1) hide show

templates/about.html CHANGED Viewed

@@ -369,6 +369,7 @@ their expression of that value).
                 These changes were made to keep up with the newly released model and to make the evaluation more detailed.
                 We describe additions made in the leaderboard here for clarity:
                 <ol>
                     <li>a new population was created and was balanced with respect to gender</li>
                     <li>context chunks - instead of evaluating the stability of a population between pairs of contexts, where all personas are given the same topic (e.g. chess), we evaluate it between pairs of context chunks, where each participant is given a different random context</li>
                     <li>more diverse and longer contexts (up to 6k tokens) were created with reddit posts from the <a target="_blank" href="https://webis.de/data/webis-tldr-17.html">webis dataset</a> (the dataset was cleaned to exclude posts from NSFW subreddits)</li>

                 These changes were made to keep up with the newly released model and to make the evaluation more detailed.
                 We describe additions made in the leaderboard here for clarity:
                 <ol>
+                    <li>Chain-of-Thought (CoT) evaluation was used</li>
                     <li>a new population was created and was balanced with respect to gender</li>
                     <li>context chunks - instead of evaluating the stability of a population between pairs of contexts, where all personas are given the same topic (e.g. chess), we evaluate it between pairs of context chunks, where each participant is given a different random context</li>
                     <li>more diverse and longer contexts (up to 6k tokens) were created with reddit posts from the <a target="_blank" href="https://webis.de/data/webis-tldr-17.html">webis dataset</a> (the dataset was cleaned to exclude posts from NSFW subreddits)</li>