File size: 11,452 Bytes
a182e84 133559f a182e84 e128aba a182e84 6d35581 3230a20 6d35581 a182e84 cb7561d a182e84 6d35581 a182e84 cea07d0 2515953 a182e84 cea07d0 a182e84 cea07d0 a182e84 cea07d0 a182e84 cea07d0 2515953 afa5edc 68ea61b 8d8079f 4751a38 e9e0c8f 68ea61b e9e0c8f 4751a38 68ea61b ba01d10 cb7561d ba01d10 cb7561d ba01d10 cb7561d ba01d10 cb7561d ba01d10 cb7561d ba01d10 cb7561d ba01d10 cb7561d ba01d10 a182e84 2515953 a182e84 e8dc2a3 a182e84 8489343 133559f 8489343 133559f a182e84 2515953 8489343 8f333f3 8489343 4b35d9a 8f333f3 4b35d9a 8489343 68ea61b 8489343 a182e84 cea07d0 a182e84 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 |
---
license: llama3.1
language:
- el
- en
pipeline_tag: text-generation
library_name: transformers
tags:
- text-generation-inference
base_model:
- ilsp/Llama-Krikri-8B-Base
---
🚨 **PLEASE USE THE OFFICIAL QUANTIZED VERSIONS: [GGUF](https://huggingface.co/ilsp/Llama-Krikri-8B-Instruct-GGUF) OR REQUEST A SPECIFIC ONE** 🚨
🚨 *There is no guarantee that you are using the latest improved versions from 3rd party quantizations as we have updated the model's weights* 🚨
# Llama-Krikri-8B-Instruct: An Instruction-tuned Large Language Model for the Greek language
<div align="center">
<img src="https://huggingface.co/ilsp/Llama-Krikri-8B-Instruct/resolve/main/KriKri_Logo-eng_54307d80-ee25-49f9-9204-0ce774499fbc.svg?raw=true" width="60%" alt="Krikri" />
</div>
Following the release of [Meltemi-7B](https://huggingface.co/ilsp/Meltemi-7B-v1) on the 26th March 2024, we are happy to welcome Krikri to the family of ILSP open Greek LLMs.
Krikri is built on top of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B), extending its capabilities for Greek through continual pretraining on a large corpus of high-quality and locally relevant Greek texts. We present **Llama-Krikri-8B-Instruct**, along with the base model, [Llama-Krikri-8B-Base](https://huggingface.co/ilsp/Llama-Krikri-8B-Base)
<!--  -->
# Model Information
## Base Model
- Vocabulary extension of the Llama-3.1 tokenizer with Greek tokens
- 128k context length (approximately 80,000 Greek words)
- We extend the pretraining of Llama-3.1-8B with added proficiency for the Greek language, by utilizing a large training corpus.
* This corpus includes 56.7 billion monolingual Greek tokens, constructed from publicly available resources.
* Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (21 billion tokens) and Greek-English parallel data (5.5 billion tokens).
* The training corpus also contains 7.8 billion math and code tokens.
* This corpus has been processed, filtered, and deduplicated to ensure data quality and is outlined below:
| Sub-corpus | # Tokens | Percentage |
|-----------|------------------|------------|
| Greek | 56.7 B | 62.3 % |
| English | 21.0 B | 23.1 % |
| Parallel | 5.5 B | 6.0 % |
| Math/Code | 7.8 B | 8.6 % |
| **Total** | 91 B | **100%** |
Chosen subsets of the 91 billion corpus were upsampled resulting in a size of **110 billion tokens**.
## Instruct Model
Llama-Krikri-8B-Instruct is the result of post-training Llama-Kriki-8B-Base and features:
- Enhanced chat capabilities and instruction-following in both Greek and English.
- Document translation from Greek to English, French, German, Italian, Portuguese, Spanish and vice versa.
- Great performance on generation, comprehension, and editing tasks, such as summarization, creative content creation, text modification, entity recognition, sentiment analysis, etc.
- Domain-specifc expertise for specialized legal, financial, medical, and scientific applications.
- Retrieval-Augmented Generation (RAG) utilizing multiple documents with 128k context length.
- Improved coding and agentic capabilities with correct formatting and tool use.
- Conversion or structured extraction (e.g., XML, JSON) in data-to-text & text-to-data settings.
- Analytical thinking and Chain-of-Thought (CoT) reasoning for problem-solving.
## Post-training Methodology
We used a multi-stage process in order to build Llama-Krikri-8B-Instruct which includes:
- 2-stage Supervised Fine-Tuning with a combination of Greek & English instruction-response pairs (& multi-turn conversations)
- **Stage 1**: **856,946** instruction-response pairs (371,379 Greek + 485,567 English)
- **Stage 2**: **638,408** instruction-response pairs (279,948 Greek + 358,460 English)
- Alignment with a combination of Greek & English preference triplets (Instruction - Chosen Response - Rejected Response)
- **Length Normalized DPO**: **92,394** preference triplets (47,132 Greek + 45,262 English)
## Post-training Data Construction
To build the SFT & DPO data, we utilized various methodologies including:
- Collecting existing high-quality datasets such as [Tulu 3](https://huggingface.co/datasets/allenai/tulu-3-sft-mixture), [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk), [MAGPIE Ultra](https://huggingface.co/datasets/argilla/magpie-ultra-v1.0), [Orca Agent Instruct](https://huggingface.co/datasets/microsoft/orca-agentinstruct-1M-v1), [IFEval Like Data](https://huggingface.co/datasets/argilla/ifeval-like-data), [UltraFeedback](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized), [NVIDIA HelpSteer2](https://huggingface.co/datasets/nvidia/HelpSteer2), [Intel Orca](https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs), [UltraMedical](https://huggingface.co/datasets/TsinghuaC3I/UltraMedical-Preference), and other datasets focused on safety, truthfulness, and instruction-following.
- Translating various data into Greek using an in-house translation tool.
- Regenerating translated data and contrasting the translated with the regenerated responses (i.e., for creating preference triplets).
- Distilling (with the MAGPIE methodology) models which exhibit strong performance in Greek, such as [Gemma 2 27B IT](https://huggingface.co/google/gemma-2-27b-it).
- Scoring data with the [Skywork Reward Gemma 2 27B v0.2](https://huggingface.co/Skywork/Skywork-Reward-Gemma-2-27B-v0.2) Reward Model and filtering using rule-based filters.
- Creating data for sentence and document translation using high-quality parallel corpora mainly from [ELRC-SHARE](https://elrc-share.eu/).
- Synthetically extracting question-answer pairs and multi-turn dialogues from diverse sources such as Wikipedia, EUR-LEX, Greek School Books, and Kallipos.
# Evaluation
In the table below, we report the scores for [Greek IFEval](https://huggingface.co/datasets/ilsp/ifeval_greek) (strict) and [English IFEval](https://huggingface.co/datasets/google/IFEval) (strict) for various chat models that exhibit strong performance.
We can observe that *Llama-Krikri-8B-Instruct exhibits the strongest performance* in instruction following for both Greek and English across all the models we tested. In particular, it surpasses Llama-3.1-8B-Instruct by **+21.7%** and **+7.3%** on the Greek and English IFEval respectively.
| | IFEval EL (strict) | IFEval EN (strict) |
|---------------- |---------------- |-----------------|
| Qwen 2.5 7B Instruct | 46.2% | 74.8% |
| EuroLLM 9B Instruct | 51.3% | 64.5% |
| Aya Expanse 8B | 50.4% | 62.2% |
| Meltemi 7B v1.5 Instruct | 32.7% | 41.2% |
| Llama-3.1-8B Instruct | 45.8% | 75.1% |
| Llama-Krikri-8B Instruct | **67.5%** | **82.4%** |
We also used the [Arena-Hard-Auto](https://huggingface.co/datasets/lmarena-ai/arena-hard-auto-v0.1) automatic evaluation tool, as well the translated (and post-edited) version for Greek that is publicly available [here](https://huggingface.co/datasets/ilsp/m-ArenaHard_greek).
Below, we show the scores for the Greek version of Arena-Hard-Auto for various open and closed chat models that were determined using **gpt-4o-2024-08-06 as the judge model** and **gpt-4o-mini-2024-07-18 as the baseline model** (i.e., by default 50% score).
![image/png]()
**Please note** that [recent research](https://arxiv.org/pdf/2502.01534?) has shown that judge models are biased towards student models, i.e., models finetuned on distilled data from a stronger/larger teacher model. While post-training data of GPT-4o-Mini are undisclosed, it would be very reasonable to assume that it has been trained -at least partly- with GPT-4o serving as the teacher model and therefore that the **judge is biased towards the baseline model**.
Below, we show the scores for the original Arena-Hard-Auto dataset for various open and closed chat models. We followed the original methodology of using **gpt-4-1106-preview as the judge model** and **gpt-4-0314 as the baseline model**.
![image/png]()
🚨 **More information on post-training, methodology, and evaluation coming soon.** 🚨
# How to use
## With Transformers
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda"
model = AutoModelForCausalLM.from_pretrained("ilsp/Llama-Krikri-8B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("ilsp/Llama-Krikri-8B-Instruct")
model.to(device)
system_prompt = "Είσαι το Κρικρί, ένα εξαιρετικά ανεπτυγμένο μοντέλο Τεχνητής Νοημοσύνης για τα ελληνικα και εκπαιδεύτηκες από το ΙΕΛ του Ε.Κ. \"Αθηνά\"."
user_prompt = "Σε τι διαφέρει ένα κρικρί από ένα λάμα;"
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
input_prompt = tokenizer(prompt, return_tensors='pt').to(device)
outputs = model.generate(input_prompt['input_ids'], max_new_tokens=256, do_sample=True)
print(tokenizer.batch_decode(outputs)[0])
```
## With OpenAI compatible server via vLLM
```bash
vllm serve ilsp/Llama-Krikri-8B-Instruct \
--enforce-eager \
--dtype 'bfloat16' \
--api-key token-abc123
```
Then, the model can be used through Python using:
```python
from openai import OpenAI
api_key = "token-abc123"
base_url = "http://localhost:8000/v1"
client = OpenAI(
api_key=api_key,
base_url=base_url,
)
system_prompt = "Είσαι ένα ανεπτυγμένο μεταφραστικό σύστημα που απαντάει με λίστες Python. Δεν γράφεις τίποτα άλλο στις απαντήσεις σου πέρα από τις μεταφρασμένες λίστες."
user_prompt = "Δώσε μου την παρακάτω λίστα με μεταφρασμένο κάθε string της στα ελληνικά: ['Ethics of duty', 'Postmodern ethics', 'Consequentialist ethics', 'Utilitarian ethics', 'Deontological ethics', 'Virtue ethics', 'Relativist ethics']"
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
]
response = client.chat.completions.create(model="ilsp/Llama-Krikri-8B-Instruct",
messages=messages,
temperature=0.0,
top_p=0.95,
max_tokens=8192,
stream=False)
print(response.choices[0].message.content)
# ['Ηθική καθήκοντος', 'Μεταμοντέρνα ηθική', 'Συνεπειοκρατική ηθική', 'Ωφελιμιστική ηθική', 'Δεοντολογική ηθική', 'Ηθική αρετών', 'Σχετικιστική ηθική']
```
# Evaluation
🚨 **Instruction following and chat capability evaluation benchmarks coming soon.** 🚨
# Acknowledgements
The ILSP team utilized Amazon's cloud computing services, which were made available via GRNET under the [OCRE Cloud framework](https://www.ocre-project.eu/), providing Amazon Web Services for the Greek Academic and Research Community. |