|
|
--- |
|
|
license: llama3.1 |
|
|
language: |
|
|
- el |
|
|
- en |
|
|
pipeline_tag: text-generation |
|
|
library_name: transformers |
|
|
tags: |
|
|
- text-generation-inference |
|
|
base_model: |
|
|
- ilsp/Llama-Krikri-8B-Base |
|
|
--- |
|
|
|
|
|
# Llama-Krikri-8B-Instruct: An Instruction-tuned Large Language Model for the Greek language |
|
|
|
|
|
Following the release of [Meltemi-7B](https://huggingface.co/ilsp/Meltemi-7B-v1) on the 26th March 2024, we are happy to welcome Krikri to the family of ILSP open Greek LLMs. |
|
|
Krikri is built on top of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B), extending its capabilities for Greek through continual pretraining on a large corpus of high-quality and locally relevant Greek texts. We present **Llama-Krikri-8B-Instruct**, along with the base model, [Llama-Krikri-8B-Base](https://huggingface.co/ilsp/Llama-Krikri-8B-Base). |
|
|
|
|
|
 |
|
|
|
|
|
|
|
|
# Model Information |
|
|
|
|
|
- Vocabulary extension of the Llama-3.1 tokenizer with Greek tokens |
|
|
- 128k context length (approximately 80,000 Greek words) |
|
|
- We extend the pretraining of Llama-3.1-8B with added proficiency for the Greek language, by utilizing a large training corpus. |
|
|
* This corpus includes 56.7 billion monolingual Greek tokens, constructed from publicly available resources. |
|
|
* Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (21 billion tokens) and Greek-English parallel data (5.5 billion tokens). |
|
|
* The training corpus also contains 7.8 billion math and code tokens. |
|
|
* This corpus has been processed, filtered, and deduplicated to ensure data quality and is outlined below: |
|
|
|
|
|
|
|
|
| Sub-corpus | # Tokens | Percentage | |
|
|
|-----------|------------------|------------| |
|
|
| Greek | 56.7 B | 62.3 % | |
|
|
| English | 21.0 B | 23.1 % | |
|
|
| Parallel | 5.5 B | 6.0 % | |
|
|
| Math/Code | 7.8 B | 8.6 % | |
|
|
| **Total** | 91 B | **100%** | |
|
|
|
|
|
|
|
|
Chosen subsets of the 91 billion corpus were upsampled resulting in a size of **110 billion tokens**. |
|
|
|
|
|
🚨 **More information of the post-training corpus and methdology coming soon.** 🚨 |
|
|
|
|
|
|
|
|
# How to use |
|
|
|
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
device = "cuda" |
|
|
|
|
|
model = AutoModelForCausalLM.from_pretrained("ilsp/Llama-Krikri-8B-Instruct") |
|
|
tokenizer = AutoTokenizer.from_pretrained("ilsp/Llama-Krikri-8B-Instruct") |
|
|
|
|
|
model.to(device) |
|
|
|
|
|
system_prompt = "Είσαι το Κρικρί, ένα εξαιρετικά ανεπτυγμένο μοντέλο Τεχνητής Νοημοσύνης για τα ελληνικα και εκπαιδεύτηκες από το ΙΕΛ του Ε.Κ. \"Αθηνά\"." |
|
|
user_prompt = "Σε τι διαφέρει ένα κρικρί από ένα λάμα;" |
|
|
|
|
|
messages = [ |
|
|
{"role": "system", "content": system_prompt}, |
|
|
{"role": "user", "content": user_prompt}, |
|
|
] |
|
|
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) |
|
|
input_prompt = tokenizer(prompt, return_tensors='pt').to(device) |
|
|
outputs = model.generate(input_prompt['input_ids'], max_new_tokens=256, do_sample=True) |
|
|
|
|
|
print(tokenizer.batch_decode(outputs)[0]) |
|
|
``` |
|
|
|
|
|
# How to serve with OpenAI compatible server via vLLM |
|
|
|
|
|
```bash |
|
|
vllm serve ilsp/Llama-Krikri-8B-Instruct \ |
|
|
--enforce-eager \ |
|
|
--dtype 'bfloat16' \ |
|
|
--api-key token-abc123 |
|
|
``` |
|
|
|
|
|
Then, the model can be used through Python using: |
|
|
```python |
|
|
from openai import OpenAI |
|
|
|
|
|
api_key = "token-abc123" |
|
|
base_url = "http://localhost:8000/v1" |
|
|
|
|
|
client = OpenAI( |
|
|
api_key=api_key, |
|
|
base_url=base_url, |
|
|
) |
|
|
|
|
|
system_prompt = "Είσαι ένα ανεπτυγμένο μεταφραστικό σύστημα που απαντάει με λίστες Python." |
|
|
user_prompt = "Δώσε μου την παρακάτω λίστα με μεταφρασμένο κάθε string της στα ελληνικά: ['Ethics of duty', 'Postmodern ethics', 'Consequentialist ethics', 'Utilitarian ethics', 'Deontological ethics', 'Virtue ethics', 'Relativist ethics']" |
|
|
|
|
|
messages = [ |
|
|
{"role": "system", "content": system_prompt}, |
|
|
{"role": "user", "content": user_prompt}, |
|
|
] |
|
|
|
|
|
response = client.chat.completions.create(model="ilsp/Llama-Krikri-8B-Instruct", |
|
|
messages=messages) |
|
|
print(response.choices[0].message.content) |
|
|
``` |
|
|
|
|
|
# Evaluation |
|
|
|
|
|
🚨 **Instruction following and chat capability evaluation benchmarks coming soon.** 🚨 |
|
|
|
|
|
# Acknowledgements |
|
|
|
|
|
The ILSP team utilized Amazon's cloud computing services, which were made available via GRNET under the [OCRE Cloud framework](https://www.ocre-project.eu/), providing Amazon Web Services for the Greek Academic and Research Community. |