Add images and update README for ArenaHard evals
Browse files
README.md
CHANGED
@@ -99,15 +99,21 @@ We can observe that *Llama-Krikri-8B-Instruct exhibits the strongest performance
|
|
99 |
| Llama-Krikri-8B Instruct | **67.5%** | **82.4%** |
|
100 |
|
101 |
|
102 |
-
We also used the [Arena-Hard-Auto](https://huggingface.co/datasets/lmarena-ai/arena-hard-auto-v0.1) automatic evaluation tool, as well the translated (and post-edited) version for Greek that is publicly available [here](https://huggingface.co/datasets/ilsp/m-ArenaHard_greek).
|
|
|
|
|
103 |
|
104 |
Below, we show the scores for the Greek version of Arena-Hard-Auto for various open and closed chat models that were determined using **gpt-4o-2024-08-06 as the judge model** and **gpt-4o-mini-2024-07-18 as the baseline model** (i.e., by default 50% score).
|
105 |
-
|
|
|
|
|
106 |
|
107 |
**Please note** that [recent research](https://arxiv.org/pdf/2502.01534?) has shown that judge models are biased towards student models, i.e., models finetuned on distilled data from a stronger/larger teacher model. While post-training data of GPT-4o-Mini are undisclosed, it would be very reasonable to assume that it has been trained -at least partly- with GPT-4o serving as the teacher model and therefore that the **judge is biased towards the baseline model**.
|
108 |
|
109 |
Below, we show the scores for the original Arena-Hard-Auto dataset for various open and closed chat models. We followed the original methodology of using **gpt-4-1106-preview as the judge model** and **gpt-4-0314 as the baseline model**.
|
110 |
-
|
|
|
|
|
111 |
|
112 |
🚨 **More information on post-training, methodology, and evaluation coming soon.** 🚨
|
113 |
|
|
|
99 |
| Llama-Krikri-8B Instruct | **67.5%** | **82.4%** |
|
100 |
|
101 |
|
102 |
+
We also used the [Arena-Hard-Auto](https://huggingface.co/datasets/lmarena-ai/arena-hard-auto-v0.1) automatic evaluation tool, as well the translated (and post-edited) version for Greek that is publicly available [here](https://huggingface.co/datasets/ilsp/m-ArenaHard_greek). We report 2 scores for Arena-Hard-Auto:
|
103 |
+
- No Style Control: The original version of the benchmark.
|
104 |
+
- With Style Control: The benchmark with style control methods for Markdown elements. You can read more about the methodology and technical background in this [blogspot](https://lmsys.org/blog/2024-08-28-style-control/).
|
105 |
|
106 |
Below, we show the scores for the Greek version of Arena-Hard-Auto for various open and closed chat models that were determined using **gpt-4o-2024-08-06 as the judge model** and **gpt-4o-mini-2024-07-18 as the baseline model** (i.e., by default 50% score).
|
107 |
+
|
108 |
+
Llama-Krikri-8B Instruct exhibits very strong chat capabilities by scoring **higher than models over 8 times its size** (such as Llama-3.1-70B Instruct) and is also **competitive with closed-source** (e.g., GPT-4o-Mini) and **highly-performant open-source models** (e.g., Gemma 2 27B IT & Aya Expanse 32B).
|
109 |
+

|
110 |
|
111 |
**Please note** that [recent research](https://arxiv.org/pdf/2502.01534?) has shown that judge models are biased towards student models, i.e., models finetuned on distilled data from a stronger/larger teacher model. While post-training data of GPT-4o-Mini are undisclosed, it would be very reasonable to assume that it has been trained -at least partly- with GPT-4o serving as the teacher model and therefore that the **judge is biased towards the baseline model**.
|
112 |
|
113 |
Below, we show the scores for the original Arena-Hard-Auto dataset for various open and closed chat models. We followed the original methodology of using **gpt-4-1106-preview as the judge model** and **gpt-4-0314 as the baseline model**.
|
114 |
+
|
115 |
+
Llama-Krikri-8B Instruct performs very well in the English variant of Arena-Hard-Auto as well, since we can observe that it is **competitive with significantly larger previous-generation LLMs** (such as Qwen 2 72B Instruct) and that it **improves upon Llama-3.1-8B Instruct by +24.5% / +16%** (No style control / With style control).
|
116 |
+

|
117 |
|
118 |
🚨 **More information on post-training, methodology, and evaluation coming soon.** 🚨
|
119 |
|