ilsp
/

Llama-Krikri-8B-Instruct

@@ -99,15 +99,21 @@ We can observe that *Llama-Krikri-8B-Instruct exhibits the strongest performance
 | Llama-Krikri-8B Instruct | **67.5%** | **82.4%** |
-We also used the [Arena-Hard-Auto](https://huggingface.co/datasets/lmarena-ai/arena-hard-auto-v0.1) automatic evaluation tool, as well the translated (and post-edited) version for Greek that is publicly available [here](https://huggingface.co/datasets/ilsp/m-ArenaHard_greek).
 Below, we show the scores for the Greek version of Arena-Hard-Auto for various open and closed chat models that were determined using **gpt-4o-2024-08-06 as the judge model** and **gpt-4o-mini-2024-07-18 as the baseline model** (i.e., by default 50% score).
-![image/png]()
 **Please note** that [recent research](https://arxiv.org/pdf/2502.01534?) has shown that judge models are biased towards student models, i.e., models finetuned on distilled data from a stronger/larger teacher model. While post-training data of GPT-4o-Mini are undisclosed, it would be very reasonable to assume that it has been trained -at least partly- with GPT-4o serving as the teacher model and therefore that the **judge is biased towards the baseline model**.
 Below, we show the scores for the original Arena-Hard-Auto dataset for various open and closed chat models. We followed the original methodology of using **gpt-4-1106-preview as the judge model** and **gpt-4-0314 as the baseline model**.
-![image/png]()
 🚨 **More information on post-training, methodology, and evaluation coming soon.** 🚨

 | Llama-Krikri-8B Instruct | **67.5%** | **82.4%** |
+We also used the [Arena-Hard-Auto](https://huggingface.co/datasets/lmarena-ai/arena-hard-auto-v0.1) automatic evaluation tool, as well the translated (and post-edited) version for Greek that is publicly available [here](https://huggingface.co/datasets/ilsp/m-ArenaHard_greek). We report 2 scores for Arena-Hard-Auto:
+- No Style Control: The original version of the benchmark.
+- With Style Control: The benchmark with style control methods for Markdown elements. You can read more about the methodology and technical background in this [blogspot](https://lmsys.org/blog/2024-08-28-style-control/).
 Below, we show the scores for the Greek version of Arena-Hard-Auto for various open and closed chat models that were determined using **gpt-4o-2024-08-06 as the judge model** and **gpt-4o-mini-2024-07-18 as the baseline model** (i.e., by default 50% score).
+Llama-Krikri-8B Instruct exhibits very strong chat capabilities by scoring **higher than models over 8 times its size** (such as Llama-3.1-70B Instruct) and is also **competitive with closed-source** (e.g., GPT-4o-Mini) and **highly-performant open-source models** (e.g., Gemma 2 27B IT & Aya Expanse 32B).
+![image/png](arena_hard_el.jpg)
 **Please note** that [recent research](https://arxiv.org/pdf/2502.01534?) has shown that judge models are biased towards student models, i.e., models finetuned on distilled data from a stronger/larger teacher model. While post-training data of GPT-4o-Mini are undisclosed, it would be very reasonable to assume that it has been trained -at least partly- with GPT-4o serving as the teacher model and therefore that the **judge is biased towards the baseline model**.
 Below, we show the scores for the original Arena-Hard-Auto dataset for various open and closed chat models. We followed the original methodology of using **gpt-4-1106-preview as the judge model** and **gpt-4-0314 as the baseline model**.
+Llama-Krikri-8B Instruct performs very well in the English variant of Arena-Hard-Auto as well, since we can observe that it is **competitive with significantly larger previous-generation LLMs** (such as Qwen 2 72B Instruct) and that it **improves upon Llama-3.1-8B Instruct by +24.5% / +16%** (No style control / With style control).
+![image/png](arena_hard_en.jpg)
 🚨 **More information on post-training, methodology, and evaluation coming soon.** 🚨