Update README.md
Browse files
README.md
CHANGED
@@ -108,7 +108,7 @@ Below, we show the scores for the Greek version of Arena-Hard-Auto for various o
|
|
108 |
Llama-Krikri-8B Instruct exhibits very strong chat capabilities by scoring **higher than models over 8 times its size** (such as Llama-3.1-70B Instruct) and is also **competitive with closed-source** (e.g., GPT-4o-Mini) and **highly-performant open-source models** (e.g., Gemma 2 27B IT & Aya Expanse 32B).
|
109 |

|
110 |
|
111 |
-
**Please note** that [recent research](https://arxiv.org/pdf/2502.01534?) has shown that judge models are biased towards student models, i.e., models finetuned on distilled data from
|
112 |
|
113 |
Below, we show the scores for the original Arena-Hard-Auto dataset for various open and closed chat models. We followed the original methodology of using **gpt-4-1106-preview as the judge model** and **gpt-4-0314 as the baseline model**.
|
114 |
|
|
|
108 |
Llama-Krikri-8B Instruct exhibits very strong chat capabilities by scoring **higher than models over 8 times its size** (such as Llama-3.1-70B Instruct) and is also **competitive with closed-source** (e.g., GPT-4o-Mini) and **highly-performant open-source models** (e.g., Gemma 2 27B IT & Aya Expanse 32B).
|
109 |

|
110 |
|
111 |
+
**Please note** that [recent research](https://arxiv.org/pdf/2502.01534?) has shown that judge models are biased towards student models, i.e., models finetuned on distilled data from the stronger & larger teacher model which also acts as a judge. While details on the post-training data of GPT-4o-Mini are undisclosed, it would be very reasonable to assume that it has been trained -at least partly- with GPT-4o serving as the teacher model and therefore that the **judge is biased towards the baseline model**.
|
112 |
|
113 |
Below, we show the scores for the original Arena-Hard-Auto dataset for various open and closed chat models. We followed the original methodology of using **gpt-4-1106-preview as the judge model** and **gpt-4-0314 as the baseline model**.
|
114 |
|