Update README.md
Browse filesAdd info on evaluation
README.md
CHANGED
|
@@ -63,25 +63,53 @@ Llama-Krikri-8B-Instruct is the result of post-training Llama-Kriki-8B-Base and
|
|
| 63 |
- Conversion or structured extraction (e.g., XML, JSON) in data-to-text & text-to-data settings.
|
| 64 |
- Analytical thinking and Chain-of-Thought (CoT) reasoning for problem-solving.
|
| 65 |
|
|
|
|
|
|
|
| 66 |
We used a multi-stage process in order to build Llama-Krikri-8B-Instruct which includes:
|
| 67 |
-
- 2-stage Supervised Fine-Tuning with a combination of Greek & English instruction-response pairs
|
| 68 |
- **Stage 1**: **856,946** instruction-response pairs (371,379 Greek + 485,567 English)
|
| 69 |
- **Stage 2**: **638,408** instruction-response pairs (279,948 Greek + 358,460 English)
|
| 70 |
-
- Alignment with a combination of Greek & English preference triplets
|
| 71 |
- **Length Normalized DPO**: **92,394** preference triplets (47,132 Greek + 45,262 English)
|
| 72 |
-
|
|
|
|
|
|
|
| 73 |
To build the SFT & DPO data, we utilized various methodologies including:
|
| 74 |
- Collecting existing high-quality datasets such as [Tulu 3](https://huggingface.co/datasets/allenai/tulu-3-sft-mixture), [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk), [MAGPIE Ultra](https://huggingface.co/datasets/argilla/magpie-ultra-v1.0), [Orca Agent Instruct](https://huggingface.co/datasets/microsoft/orca-agentinstruct-1M-v1), [IFEval Like Data](https://huggingface.co/datasets/argilla/ifeval-like-data), [UltraFeedback](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized), [NVIDIA HelpSteer2](https://huggingface.co/datasets/nvidia/HelpSteer2), [Intel Orca](https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs), [UltraMedical](https://huggingface.co/datasets/TsinghuaC3I/UltraMedical-Preference), and other datasets focused on safety, truthfulness, and instruction-following.
|
| 75 |
- Translating various data into Greek using an in-house translation tool.
|
|
|
|
| 76 |
- Distilling (with the MAGPIE methodology) models which exhibit strong performance in Greek, such as [Gemma 2 27B IT](https://huggingface.co/google/gemma-2-27b-it).
|
| 77 |
- Scoring data with the [Skywork Reward Gemma 2 27B v0.2](https://huggingface.co/Skywork/Skywork-Reward-Gemma-2-27B-v0.2) Reward Model and filtering using rule-based filters.
|
| 78 |
- Creating data for sentence and document translation using high-quality parallel corpora mainly from [ELRC-SHARE](https://elrc-share.eu/).
|
| 79 |
-
- Synthetically extracting question-answer pairs
|
| 80 |
|
| 81 |
# Evaluation
|
| 82 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 83 |
|
| 84 |
-
🚨 **More information on post-training,
|
| 85 |
|
| 86 |
|
| 87 |
# How to use
|
|
|
|
| 63 |
- Conversion or structured extraction (e.g., XML, JSON) in data-to-text & text-to-data settings.
|
| 64 |
- Analytical thinking and Chain-of-Thought (CoT) reasoning for problem-solving.
|
| 65 |
|
| 66 |
+
## Post-training Methodology
|
| 67 |
+
|
| 68 |
We used a multi-stage process in order to build Llama-Krikri-8B-Instruct which includes:
|
| 69 |
+
- 2-stage Supervised Fine-Tuning with a combination of Greek & English instruction-response pairs (& multi-turn conversations)
|
| 70 |
- **Stage 1**: **856,946** instruction-response pairs (371,379 Greek + 485,567 English)
|
| 71 |
- **Stage 2**: **638,408** instruction-response pairs (279,948 Greek + 358,460 English)
|
| 72 |
+
- Alignment with a combination of Greek & English preference triplets (Instruction - Chosen Response - Rejected Response)
|
| 73 |
- **Length Normalized DPO**: **92,394** preference triplets (47,132 Greek + 45,262 English)
|
| 74 |
+
|
| 75 |
+
## Post-training Data Construction
|
| 76 |
+
|
| 77 |
To build the SFT & DPO data, we utilized various methodologies including:
|
| 78 |
- Collecting existing high-quality datasets such as [Tulu 3](https://huggingface.co/datasets/allenai/tulu-3-sft-mixture), [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk), [MAGPIE Ultra](https://huggingface.co/datasets/argilla/magpie-ultra-v1.0), [Orca Agent Instruct](https://huggingface.co/datasets/microsoft/orca-agentinstruct-1M-v1), [IFEval Like Data](https://huggingface.co/datasets/argilla/ifeval-like-data), [UltraFeedback](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized), [NVIDIA HelpSteer2](https://huggingface.co/datasets/nvidia/HelpSteer2), [Intel Orca](https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs), [UltraMedical](https://huggingface.co/datasets/TsinghuaC3I/UltraMedical-Preference), and other datasets focused on safety, truthfulness, and instruction-following.
|
| 79 |
- Translating various data into Greek using an in-house translation tool.
|
| 80 |
+
- Regenerating translated data and contrasting the translated with the regenerated responses (i.e., for creating preference triplets).
|
| 81 |
- Distilling (with the MAGPIE methodology) models which exhibit strong performance in Greek, such as [Gemma 2 27B IT](https://huggingface.co/google/gemma-2-27b-it).
|
| 82 |
- Scoring data with the [Skywork Reward Gemma 2 27B v0.2](https://huggingface.co/Skywork/Skywork-Reward-Gemma-2-27B-v0.2) Reward Model and filtering using rule-based filters.
|
| 83 |
- Creating data for sentence and document translation using high-quality parallel corpora mainly from [ELRC-SHARE](https://elrc-share.eu/).
|
| 84 |
+
- Synthetically extracting question-answer pairs and multi-turn dialogues from diverse sources such as Wikipedia, EUR-LEX, Greek School Books, and Kallipos.
|
| 85 |
|
| 86 |
# Evaluation
|
| 87 |
|
| 88 |
+
In the table below, we report the scores for [Greek IFEval](https://huggingface.co/datasets/ilsp/ifeval_greek) (strict) and [English IFEval](https://huggingface.co/datasets/google/IFEval) (strict) for various chat models that exhibit strong performance.
|
| 89 |
+
|
| 90 |
+
We can observe that *Llama-Krikri-8B-Instruct exhibits the strongest performance* in instruction following for both Greek and English across all the models we tested. In particular, it surpasses Llama-3.1-8B-Instruct by **+21.7%** and **+7.3%** on the Greek and English IFEval respectively.
|
| 91 |
+
|
| 92 |
+
| | IFEval EL (strict) | IFEval EN (strict) |
|
| 93 |
+
|---------------- |---------------- |-----------------|
|
| 94 |
+
| Qwen 2.5 7B Instruct | 46.2% | 74.8% |
|
| 95 |
+
| EuroLLM 9B Instruct | 51.3% | 64.5% |
|
| 96 |
+
| Aya Expanse 8B | 50.4% | 62.2% |
|
| 97 |
+
| Meltemi 7B v1.5 Instruct | 32.7% | 41.2% |
|
| 98 |
+
| Llama-3.1-8B Instruct | 45.8% | 75.1% |
|
| 99 |
+
| Llama-Krikri-8B Instruct | **67.5%** | **82.4%** |
|
| 100 |
+
|
| 101 |
+
|
| 102 |
+
We also used the [Arena-Hard-Auto](https://huggingface.co/datasets/lmarena-ai/arena-hard-auto-v0.1) automatic evaluation tool, as well the translated (and post-edited) version for Greek that is publicly available [here](https://huggingface.co/datasets/ilsp/m-ArenaHard_greek).
|
| 103 |
+
|
| 104 |
+
Below, we show the scores for the Greek version of Arena-Hard-Auto for various open and closed chat models that were determined using **gpt-4o-2024-08-06 as the judge model** and **gpt-4o-mini-2024-07-18 as the baseline model** (i.e., by default 50% score).
|
| 105 |
+
![image/png]()
|
| 106 |
+
|
| 107 |
+
**Please note** that [recent research](https://arxiv.org/pdf/2502.01534?) has shown that judge models are biased towards student models, i.e., models finetuned on distilled data from a stronger/larger teacher model. While post-training data of GPT-4o-Mini are undisclosed, it would be very reasonable to assume that it has been trained -at least partly- with GPT-4o serving as the teacher model and therefore that the **judge is biased towards the baseline model**.
|
| 108 |
+
|
| 109 |
+
Below, we show the scores for the original Arena-Hard-Auto dataset for various open and closed chat models. We followed the original methodology of using **gpt-4-1106-preview as the judge model** and **gpt-4-0314 as the baseline model**.
|
| 110 |
+
![image/png]()
|
| 111 |
|
| 112 |
+
🚨 **More information on post-training, methodology, and evaluation coming soon.** 🚨
|
| 113 |
|
| 114 |
|
| 115 |
# How to use
|