Update README.md
Browse files
README.md
CHANGED
@@ -12,90 +12,18 @@ pipeline_tag: text-generation
|
|
12 |
# Meltemi: A large foundation Language Model for the Greek language
|
13 |
|
14 |
We introduce Meltemi, the first Greek Large Language Model (LLM) trained by the Institute for Language and Speech Processing at Athena Research & Innovation Center.
|
15 |
-
Meltemi is built on top of Mistral-7b
|
16 |
-
Additionally, in the near future, we will also release a Mixture-of-Experts foundation model (MeltemiX-8x7b), as well as chat models based on real chats with human feedback.
|
17 |
|
18 |
-
|
19 |
-
The training was performed on AWS infrastructure thanks to a GRNET grant.
|
20 |
-
We release two models trained with 8k context length: Meltemi-7b-v1 (INSERT HF LINK) and Meltemi-Instruct-7b-v1 (INSERT HF LINK) under the [Apache 2.0 License](https://github.com/apache/.github/blob/main/LICENSE).
|
21 |
-
To assess the capabilities of Meltemi we constructed a standardized LLM evaluation suite for the Greek language, integrated with [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness).
|
22 |
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
# Continual pretraining
|
34 |
-
|
35 |
-
The original version of Mistral-7b is trained on a large corpus of English text. The corpus for the publicly released versions is estimated to contain approximately 800 billion tokens.
|
36 |
-
We extend the pretraining of Mistral-7b with added proficiency for the Greek language, by utilizing a large corpus consisting of approximately **40 billion tokens**.
|
37 |
-
|
38 |
-
This corpus includes 28.5 billion monolingual Greek tokens, constructed from publicly available resources. Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (10.5 billion tokens) and Greek-English parallel data (600 million tokens).
|
39 |
-
|
40 |
-
This corpus has been processed, filtered, and deduplicated to ensure data quality (a detailed description of our data processing pipeline will be published in our upcoming paper) and is outlined below:
|
41 |
-
|
42 |
-
<br/>
|
43 |
-
<br/>
|
44 |
-
Table 1: Pretraining Corpora
|
45 |
-
|
46 |
-
| Sub-corpus | # Tokens | Percentage |
|
47 |
-
|----------|------------------|------------|
|
48 |
-
| Greek | 28,555,902,360 | 72.0% |
|
49 |
-
| English | 10,478,414,033 | 26.4% |
|
50 |
-
| Parallel | 633,816,023 | 1.6% |
|
51 |
-
| **Total** | **39,668,132,416** | **100%** |
|
52 |
-
<br/>
|
53 |
-
<br/>
|
54 |
-
|
55 |
-
|
56 |
-
Our pretraining procedure uses insights from works which are focused on continual pretraining for adapting English models to a non-latin script language (Chinese), such as [Fast and efficient pretraining](https://arxiv.org/pdf/2304.08177.pdf) and [CollosalAI](https://huggingface.co/hpcai-tech/Colossal-LLaMA-2-7b-base).
|
57 |
-
Our pretraining strategy consists of the following three stages:
|
58 |
-
|
59 |
-
1. Vocabulary extension of the Mistral-7b tokenizer with Greek tokens
|
60 |
-
2. Greek embedding initialization and fine-tuning on 10% of the corpus (all other model parameters are kept frozen)
|
61 |
-
3. Continual pretraining of the whole model using the full corpus
|
62 |
-
|
63 |
-
We use the following hyperparameters and training settings for the continual pretraining stage:
|
64 |
-
|
65 |
-
<br/>
|
66 |
-
|
67 |
-
Table 2: Training settings
|
68 |
-
|
69 |
-
| Training settings | |
|
70 |
-
|-----------------|---------|
|
71 |
-
| Training steps | 25340 |
|
72 |
-
| Warmup steps | 253 |
|
73 |
-
| Batch size | 512 |
|
74 |
-
| Context length | 8192 |
|
75 |
-
| Optimizer | AdamW |
|
76 |
-
| Learning rate | 2.5e-5 |
|
77 |
-
| Learning rate decay | Cosine down to 2.5e-6 |
|
78 |
-
| Adam β | (0.9, 0.95) |
|
79 |
-
| Weight decay | 0.0 |
|
80 |
-
| DeepSpeed | Zero Stage-2 |
|
81 |
-
| Precision | BF16 |
|
82 |
-
| GPUs | 8 x NVIDIA H100 (80GB) |
|
83 |
-
| Energy footprint | 2300 kWh |
|
84 |
-
|
85 |
-
<br/>
|
86 |
-
<br/>
|
87 |
-
|
88 |
-
# Supervised fine-tuning
|
89 |
-
|
90 |
-
|
91 |
-
To create Meltemi-Instruct-7b, we utilize approximately 100k Greek instructions, which include machine-translated versions of existing single-turn and multi-turn conversation datasets. In particular, we used the following:
|
92 |
-
|
93 |
-
* [Open-Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus) (only subsets with permissive licenses)
|
94 |
-
* [Evol-Instruct](https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k)
|
95 |
-
* [Capybara](https://huggingface.co/datasets/LDJnr/Capybara)
|
96 |
-
* A manually created Greek dataset with multi-turn examples steering the instruction-tuned model towards safe and harmless responses
|
97 |
-
|
98 |
-
The model is trained on the resulting instructions using full Supervised Fine-Tuning (SFT). Our SFT procedure is based on the [finetuning recipes](https://github.com/huggingface/alignment-handbook) provided by huggingface. We are extending and improving the instruction tuning dataset to enhance the model's chat and translation capabilities.
|
99 |
|
100 |
|
101 |
# Evaluation
|
@@ -107,12 +35,7 @@ Our evaluation suite includes:
|
|
107 |
* An existing benchmark for question answering in Greek ([Belebele](https://arxiv.org/abs/2308.16884))
|
108 |
* A novel benchmark created by the ILSP team for medical question answering based on the medical exams of [DOATAP](https://www.doatap.gr) ([Medical MCQA](https://huggingface.co/datasets/ilsp/medical_mcqa_greek)).
|
109 |
|
110 |
-
Our evaluation for Meltemi-7b is performed in a few-shot setting, consistent with the settings in the [Open LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). We can see that our training enhances performance across all Greek test sets by a **+14.9%** average improvement. The results for the Greek test sets are shown in
|
111 |
-
|
112 |
-
<br/>
|
113 |
-
<br/>
|
114 |
-
|
115 |
-
Table 3: Evaluation of Meltemi-7b on the Greek LLM benchmark
|
116 |
|
117 |
| | Medical MCQA EL (15-shot) | Belebele EL (5-shot) | HellaSwag EL (10-shot) | ARC-Challenge EL (25-shot) | TruthfulQA MC2 EL (0-shot) | MMLU EL (5-shot) | Average |
|
118 |
|----------------|----------------|-------------|--------------|------------------|-------------------|---------|---------|
|
@@ -120,35 +43,10 @@ Table 3: Evaluation of Meltemi-7b on the Greek LLM benchmark
|
|
120 |
| Meltemi 7B | 41.0% | 63.6% | 61.6% | 43.2% | 52.1% | 47% | 51.4% |
|
121 |
|
122 |
|
123 |
-
<br/>
|
124 |
-
<br/>
|
125 |
-
|
126 |
-
![Alt text](./meltemi-mistral.png)
|
127 |
-
|
128 |
-
Figure 1: Comparison of Meltemi-7b and Mistral-7b on Greek test sets
|
129 |
-
|
130 |
-
<br/>
|
131 |
-
<br/>
|
132 |
-
|
133 |
-
# Try it yourself
|
134 |
-
|
135 |
-
You can try the released models yourself in the [following link]().
|
136 |
-
|
137 |
-
# Code availability
|
138 |
-
|
139 |
-
All the training and fine-tuning scripts, as well as our lm-evaluation-harness fork will be made publicly available under a permissive license.
|
140 |
-
|
141 |
# Acknowledgements
|
142 |
|
143 |
The ILSP team utilized Amazon’s cloud computing services, which were made available via GRNET under the [OCRE Cloud framework](https://www.ocre-project.eu/), providing Amazon Web Services for the Greek Academic and Research Community.
|
144 |
|
145 |
-
|
146 |
-
|
147 |
-
|
148 |
-
* Data acquisition and curation: Dimitris Roussis, Leon Voukoutis, Prokopis Prokopidis, Vassilis Papavassiliou
|
149 |
-
* Model training: Leon Voukoutis, Dimitris Roussis
|
150 |
-
* Model evaluation: Prokopis Prokopidis, Dimitris Roussis, Leon Voukoutis
|
151 |
-
* Infrastructure: Sokratis Sofianopoulos, George Paraskevopoulos
|
152 |
-
* Technical supervision: Nassos Katsamanis, Stelios Piperidis, Sokratis Sofianopoulos, George Paraskevopoulos
|
153 |
-
|
154 |
-
Special thanks to Sotiris Kotitsas, Petros Stavropoulos, Dimitris Pappas, Dimitris Galanis for their input during the design and development process. Special thanks to Olga Yannoutsou for her help in the translation of one of the evaluation datasets. And special thanks as well to all members of ILSP that participated in the internal evaluation.
|
|
|
12 |
# Meltemi: A large foundation Language Model for the Greek language
|
13 |
|
14 |
We introduce Meltemi, the first Greek Large Language Model (LLM) trained by the Institute for Language and Speech Processing at Athena Research & Innovation Center.
|
15 |
+
Meltemi is built on top of [Mistral-7b](https://huggingface.co/mistralai/Mistral-7B-v0.1), extending its capabilities for Greek through continual pretraining on a large corpus of high-quality and locally relevant Greek texts. We present Meltemi-7B-Instruct-v1, an instruct fine-tuned version of Meltemi-7B-v1.
|
|
|
16 |
|
17 |
+
# Model Information
|
|
|
|
|
|
|
18 |
|
19 |
+
- Vocabulary extension of the Mistral-7b tokenizer with Greek tokens
|
20 |
+
- Trained with 8k context length
|
21 |
+
- Fine-tuned with 100k Greek machine translated instructions extracted from:
|
22 |
+
* [Open-Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus) (only subsets with permissive licenses)
|
23 |
+
* [Evol-Instruct](https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k)
|
24 |
+
* [Capybara](https://huggingface.co/datasets/LDJnr/Capybara)
|
25 |
+
* A hand-crafted Greek dataset with multi-turn examples steering the instruction-tuned model towards safe and harmless responses
|
26 |
+
- Our SFT procedure is based on the [Hugging Face finetuning recipes](https://github.com/huggingface/alignment-handbook)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
27 |
|
28 |
|
29 |
# Evaluation
|
|
|
35 |
* An existing benchmark for question answering in Greek ([Belebele](https://arxiv.org/abs/2308.16884))
|
36 |
* A novel benchmark created by the ILSP team for medical question answering based on the medical exams of [DOATAP](https://www.doatap.gr) ([Medical MCQA](https://huggingface.co/datasets/ilsp/medical_mcqa_greek)).
|
37 |
|
38 |
+
Our evaluation for Meltemi-7b is performed in a few-shot setting, consistent with the settings in the [Open LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). We can see that our training enhances performance across all Greek test sets by a **+14.9%** average improvement. The results for the Greek test sets are shown in the following table:
|
|
|
|
|
|
|
|
|
|
|
39 |
|
40 |
| | Medical MCQA EL (15-shot) | Belebele EL (5-shot) | HellaSwag EL (10-shot) | ARC-Challenge EL (25-shot) | TruthfulQA MC2 EL (0-shot) | MMLU EL (5-shot) | Average |
|
41 |
|----------------|----------------|-------------|--------------|------------------|-------------------|---------|---------|
|
|
|
43 |
| Meltemi 7B | 41.0% | 63.6% | 61.6% | 43.2% | 52.1% | 47% | 51.4% |
|
44 |
|
45 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
46 |
# Acknowledgements
|
47 |
|
48 |
The ILSP team utilized Amazon’s cloud computing services, which were made available via GRNET under the [OCRE Cloud framework](https://www.ocre-project.eu/), providing Amazon Web Services for the Greek Academic and Research Community.
|
49 |
|
50 |
+
# Ethical Considerations
|
51 |
+
|
52 |
+
This model has not been aligned with human preferences, and therefore might generate misleading, harmful, and toxic content.
|
|
|
|
|
|
|
|
|
|
|
|
|
|