ilsp
/

Llama-Krikri-8B-Instruct

@@ -85,18 +85,23 @@ To build the SFT & DPO data, we utilized various methodologies including:
 # Evaluation
-In the table below, we report the scores for [Greek IFEval](https://huggingface.co/datasets/ilsp/ifeval_greek) (strict) and [English IFEval](https://huggingface.co/datasets/google/IFEval) (strict) for various chat models that exhibit strong performance.
 We can observe that *Llama-Krikri-8B-Instruct exhibits the strongest performance* in instruction following for both Greek and English across all the models we tested. In particular, it surpasses Llama-3.1-8B-Instruct by **+21.7%** and **+7.3%** on the Greek and English IFEval respectively.
-|       | IFEval EL (strict) | IFEval EN (strict) |
-|---------------- |---------------- |-----------------|
-| Qwen 2.5 7B Instruct | 46.2% | 74.8% |
-| EuroLLM 9B Instruct | 51.3% | 64.5% |
-| Aya Expanse 8B | 50.4% | 62.2% |
-| Meltemi 7B v1.5 Instruct | 32.7% | 41.2% |
-| Llama-3.1-8B Instruct | 45.8% | 75.1% |
-| **Llama-Krikri-8B Instruct** | **67.5%** | **82.4%** |
 We also used the [Arena-Hard-Auto](https://huggingface.co/datasets/lmarena-ai/arena-hard-auto-v0.1) automatic evaluation tool, as well the translated (and post-edited) version for Greek that is publicly available [here](https://huggingface.co/datasets/ilsp/m-ArenaHard_greek). We report 2 scores for Arena-Hard-Auto:
@@ -108,13 +113,13 @@ Below, we show the scores for the Greek version of Arena-Hard-Auto for various o
 Llama-Krikri-8B Instruct exhibits very strong chat capabilities by scoring **higher than models over 8 times its size** (such as Llama-3.1-70B Instruct) and is also **competitive with closed-source** (e.g., GPT-4o-Mini) and **highly-performant open-source models** (e.g., Gemma 2 27B IT & Aya Expanse 32B).
 ![image/png](arena_hard_el.png)
-**Please note** that [recent research](https://arxiv.org/pdf/2502.01534?) has shown that judge models are biased towards student models, i.e., models finetuned on distilled data from the stronger & larger teacher model which also acts as a judge. While details on the post-training data of GPT-4o-Mini are undisclosed, it would be very reasonable to assume that it has been trained -at least partly- with GPT-4o serving as the teacher model and therefore that the **judge is biased towards the baseline model**.
-Below, we show the scores for the original Arena-Hard-Auto dataset for various open and closed chat models. We followed the original methodology of using **gpt-4-1106-preview as the judge model** and **gpt-4-0314 as the baseline model**.
 Llama-Krikri-8B Instruct performs very well in the English variant of Arena-Hard-Auto as well, since we can observe that it is **competitive with significantly larger previous-generation LLMs** (such as Qwen 2 72B Instruct) and that it **improves upon Llama-3.1-8B Instruct by +24.5% / +16%** (No style control / With style control).
 ![image/png](arena_hard_en.png)
 🚨 **More information on post-training, methodology, and evaluation coming soon.** 🚨

 # Evaluation
+In the table below, we report the scores for our chat evaluation suite which includes:
+- [Greek IFEval](https://huggingface.co/datasets/ilsp/ifeval_greek) (strict)
+- [English IFEval](https://huggingface.co/datasets/google/IFEval) (strict)
+- [Greek MT-Bench](https://huggingface.co/datasets/ilsp/mt-bench-greek) using gpt-4o-2024-08-06 as the judge model.
+- [English MT-Bench](https://huggingface.co/datasets/HuggingFaceH4/mt_bench_prompts) using gpt-4o-2024-08-06 as the judge model.
 We can observe that *Llama-Krikri-8B-Instruct exhibits the strongest performance* in instruction following for both Greek and English across all the models we tested. In particular, it surpasses Llama-3.1-8B-Instruct by **+21.7%** and **+7.3%** on the Greek and English IFEval respectively.
+It also exhibits **the strongest chat capabilities in the Greek MT-Bench benchmark** (+0.28 compared to Aya Expanse 8B), while also being very competitive in the English variant of the MT-Bench benchmark.
+|       | IFEval EL (strict) | IFEval EN (strict) | MT-Bench EL | MT-Bench EN |
+|---------------- |---------------- |----------------- |------------|------------|
+| Qwen 2.5 7B Instruct | 46.2% | 74.8% | 5.83 | **7.87** |
+| EuroLLM 9B Instruct | 51.3% | 64.5% | 5.98 | 6.27 |
+| Aya Expanse 8B | 50.4% | 62.2% | 7.68 | 6.92 |
+| Meltemi 7B v1.5 Instruct | 32.7% | 41.2% | 6.25 | 5.46 |
+| Llama-3.1-8B Instruct | 45.8% | 75.1% | 6.46 | 7.25 |
+| **Llama-Krikri-8B Instruct** | **67.5%** | **82.4%** | **7.96** | 7.21 |
 We also used the [Arena-Hard-Auto](https://huggingface.co/datasets/lmarena-ai/arena-hard-auto-v0.1) automatic evaluation tool, as well the translated (and post-edited) version for Greek that is publicly available [here](https://huggingface.co/datasets/ilsp/m-ArenaHard_greek). We report 2 scores for Arena-Hard-Auto:
 Llama-Krikri-8B Instruct exhibits very strong chat capabilities by scoring **higher than models over 8 times its size** (such as Llama-3.1-70B Instruct) and is also **competitive with closed-source** (e.g., GPT-4o-Mini) and **highly-performant open-source models** (e.g., Gemma 2 27B IT & Aya Expanse 32B).
 ![image/png](arena_hard_el.png)
+Below, we show the scores for the original Arena-Hard-Auto dataset for various open and closed chat models. We followed the original methodology by using **gpt-4-1106-preview as the judge model** and **gpt-4-0314 as the baseline model**.
 Llama-Krikri-8B Instruct performs very well in the English variant of Arena-Hard-Auto as well, since we can observe that it is **competitive with significantly larger previous-generation LLMs** (such as Qwen 2 72B Instruct) and that it **improves upon Llama-3.1-8B Instruct by +24.5% / +16%** (No style control / With style control).
 ![image/png](arena_hard_en.png)
+***Please note** that judge models are biased towards student models trained on distilled data from them. You can read more [here](https://arxiv.org/pdf/2502.01534?).
 🚨 **More information on post-training, methodology, and evaluation coming soon.** 🚨