ilsp
/

File size: 4,687 Bytes
a182e84
 
 
 
 
 
 
 
 
133559f
 
a182e84
 
 
 
 
04162e1
a182e84
 
 
cea07d0
a182e84
 
 
cea07d0
a182e84
cea07d0
 
 
a182e84
 
 
 
 
cea07d0
 
 
 
 
 
a182e84
cea07d0
 
 
a182e84
 
 
 
 
 
 
 
 
 
e8dc2a3
 
a182e84
 
 
8489343
 
 
133559f
8489343
 
133559f
 
 
 
a182e84
 
 
 
8489343
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cea07d0
8489343
 
 
 
 
 
 
 
 
 
 
a182e84
 
 
cea07d0
a182e84
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
---
license: llama3.1
language:
- el
- en
pipeline_tag: text-generation
library_name: transformers
tags:
- text-generation-inference
base_model:
- ilsp/Llama-Krikri-8B-Base
---

# Llama-Krikri-8B-Instruct: An Instruction-tuned Large Language Model for the Greek language

Following the release of [Meltemi-7B](https://huggingface.co/ilsp/Meltemi-7B-v1) on the 26th March 2024, we are happy to welcome Krikri to the family of ILSP open Greek LLMs.
Krikri is built on top of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B), extending its capabilities for Greek through continual pretraining on a large corpus of high-quality and locally relevant Greek texts. We present **Llama-Krikri-8B-Instruct**, along with the base model, [Llama-Krikri-8B-Base](https://huggingface.co/ilsp/Llama-Krikri-8B-Base).

![image/png](llama-krikri-image.jpg)


# Model Information

- Vocabulary extension of the Llama-3.1 tokenizer with Greek tokens
- 128k context length (approximately 80,000 Greek words)
- We extend the pretraining of Llama-3.1-8B with added proficiency for the Greek language, by utilizing a large training corpus. 
  * This corpus includes 56.7 billion monolingual Greek tokens, constructed from publicly available resources.
  * Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (21 billion tokens) and Greek-English parallel data (5.5 billion tokens).
  * The training corpus also contains 7.8 billion math and code tokens.
  * This corpus has been processed, filtered, and deduplicated to ensure data quality and is outlined below:


| Sub-corpus   | # Tokens         | Percentage |
|-----------|------------------|------------|
| Greek     | 56.7 B   | 62.3 %      |
| English   | 21.0 B   | 23.1 %      |
| Parallel  |  5.5 B   | 6.0 %       |
| Math/Code |  7.8 B   | 8.6 %       |
| **Total** | 91 B   |  **100%**       |


Chosen subsets of the 91 billion corpus were upsampled resulting in a size of **110 billion tokens**.

🚨 **More information of the post-training corpus and methdology coming soon.** 🚨


# How to use


```python
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"

model = AutoModelForCausalLM.from_pretrained("ilsp/Llama-Krikri-8B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("ilsp/Llama-Krikri-8B-Instruct")

model.to(device)

system_prompt = "Είσαι το Κρικρί, ένα εξαιρετικά ανεπτυγμένο μοντέλο Τεχνητής Νοημοσύνης για τα ελληνικα και εκπαιδεύτηκες από το ΙΕΛ του Ε.Κ. \"Αθηνά\"."
user_prompt = "Σε τι διαφέρει ένα κρικρί από ένα λάμα;"

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt},
]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
input_prompt = tokenizer(prompt, return_tensors='pt').to(device)
outputs = model.generate(input_prompt['input_ids'], max_new_tokens=256, do_sample=True)

print(tokenizer.batch_decode(outputs)[0])
```

# How to serve with OpenAI compatible server via vLLM

```bash
vllm serve ilsp/Llama-Krikri-8B-Instruct \
  --enforce-eager \
  --dtype 'bfloat16' \
  --api-key token-abc123
```

Then, the model can be used through Python using:
```python
from openai import OpenAI

api_key = "token-abc123"
base_url = "http://localhost:8000/v1"

client = OpenAI(
    api_key=api_key,
    base_url=base_url,
)

system_prompt = "Είσαι ένα ανεπτυγμένο μεταφραστικό σύστημα που απαντάει με λίστες Python."
user_prompt = "Δώσε μου την παρακάτω λίστα με μεταφρασμένο κάθε string της στα ελληνικά: ['Ethics of duty', 'Postmodern ethics', 'Consequentialist ethics', 'Utilitarian ethics', 'Deontological ethics', 'Virtue ethics', 'Relativist ethics']"

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt},
]

response = client.chat.completions.create(model="ilsp/Llama-Krikri-8B-Instruct",
                                          messages=messages)
print(response.choices[0].message.content)
```

# Evaluation

🚨 **Instruction following and chat capability evaluation benchmarks coming soon.** 🚨

# Acknowledgements

The ILSP team utilized Amazon's cloud computing services, which were made available via GRNET under the [OCRE Cloud framework](https://www.ocre-project.eu/), providing Amazon Web Services for the Greek Academic and Research Community.