Text Generation
Transformers
PyTorch
Safetensors
llama
text-generation-inference
mfromm commited on
Commit
b18e99f
·
verified ·
1 Parent(s): b8a7fec

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +67 -34
README.md CHANGED
@@ -33,7 +33,7 @@ license: apache-2.0
33
  ---
34
  # Model Card for Teuken 7B-base-v0.6
35
 
36
- Teuken 7B-base-v0.6 is a 7B parameter multilingual large language model (LLM) pre-trained with 6T tokens within the research project OpenGPT-X.
37
 
38
 
39
  ### Model Description
@@ -49,9 +49,14 @@ Teuken 7B-base-v0.6 is a 7B parameter multilingual large language model (LLM) pr
49
  ## Uses
50
 
51
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
52
- Teuken 7B-base-v0.6 is intended for commercial and research use in all official 24 European languages. Since Teuken 7B-base-v0.6
53
  focuses on covering all 24 EU languages, it renders more stable results across these languages and better reflects European values in its answers than English-centric models. It is therefore specialized for use in multilingual tasks.
54
 
 
 
 
 
 
55
  ### Out-of-Scope Use
56
 
57
  <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
@@ -62,19 +67,24 @@ The model is not intended for use in math and coding tasks.
62
 
63
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
64
 
65
- Teuken 7B-base-v0.6 as a base model is not free from biases and hallucinations. It is therefore recommended to instruction tune it to fit it to the user's purposes and minimize biases and any risks arising. Finetuned models limiting risks and biases will appear soon after the release of the base model as a community effort.
66
 
67
  ## How to Get Started with the Model
68
 
69
  ## Usage
70
- The model requires transformers, sentencepiece, and the torch library.
 
 
 
 
 
71
  After installation, here's an example of how to use the model:
72
 
73
  ```python
74
  import torch
75
  from transformers import AutoModelForCausalLM, AutoTokenizer
76
 
77
- model_name = "openGPT-X/Teuken 7B-base-v0.6"
78
  prompt = "Insert text here..."
79
  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
80
  tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
@@ -93,7 +103,7 @@ This example demonstrates how to load the model and tokenizer, prepare input, ge
93
 
94
  <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
95
 
96
- Teuken 7B-base-v0.6 was pre-trained on 5.5 trillion tokens of data from publicly available sources.
97
 
98
  The pretraining data has a cutoff of September 2023.
99
 
@@ -111,6 +121,25 @@ Transformer-based decoder-only model that has been trained based on the causal l
111
 
112
  <!-- This section describes the evaluation protocols and provides the results. -->
113
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
114
  ### Testing Data, Factors & Metrics
115
 
116
  #### Testing Data
@@ -139,7 +168,7 @@ The model was evaluated in 21 languages on ARC, GSM8K, HellaSwag, TruthfulQA, Tr
139
  | Num Query Groups | 2 |
140
  | Normalization | RMSNorm |
141
  | Learning rate | 3e-4 |
142
- | Min learning rate | 3e-5 |
143
  | Disable bias in linear | yes |
144
  | Hidden dropout | 0.0 |
145
  | Attention dropout | 0.0 |
@@ -173,38 +202,25 @@ The configuration of JUWELS Booster compute nodes is the following:
173
 
174
  #### Software
175
 
176
- https://github.com/OpenGPTX/Megatron-LM
177
-
178
- ### Toxic Content
179
-
180
- This Language Model (LLM) may generate content that is inappropriate, offensive, or harmful. While the dataset has been heavily filtered to minimize such outputs,
181
- the model may still produce text that is biased or toxic due to the large scale and diverse nature of the data.
182
 
183
  ## Citation
184
 
185
- TODO
186
-
187
-
188
  **BibTeX:**
189
 
190
- TODO
191
-
192
- **APA:**
193
-
194
- TODO
195
-
196
- ## Model Card Contact
197
 
198
- <div class="hf-card">
199
- <h2>Contact Information</h2>
200
- <p>You can reach out to the following model card contact:</p>
201
- <ul>
202
- <li>
203
- <a href="https://huggingface.co/openGPT-X" target="_blank">OpenGPT-X</a>
204
- - <a href="mailto:contact@opengpt-x.de">[email protected]</a>
205
- </li>
206
- </ul>
207
- </div>
208
 
209
  # Team
210
  ## Data Team
@@ -220,4 +236,21 @@ Klaudia Thellmann (TUD), Alex Jude (IAIS), Jasper Buschhoff (IAIS)
220
  ### Contributors:
221
  Shima Assadi (IIS), Fabio Barth (DFKI)
222
  ## Management
223
- Joachim Köhler (IAIS), Nicolas Flores-Herr (IAIS), Stefan Kesselheim (FZJ), Andreas Herten (FZJ), Georg Rehm (DFKI), René Jäkel (TUD), Fabian Küch (IIS), Nicole Hildebrandt (IAIS), Ines Wendler (IAIS)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  ---
34
  # Model Card for Teuken 7B-base-v0.6
35
 
36
+ [Teuken 7B-base-v0.6](https://huggingface.co/openGPT-X/Teuken-7B-base-v0.6) is a 7B parameter multilingual large language model (LLM) pre-trained with 6T tokens within the research project [OpenGPT-X](https://opengpt-x.de).
37
 
38
 
39
  ### Model Description
 
49
  ## Uses
50
 
51
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
52
+ [Teuken 7B-base-v0.6](https://huggingface.co/openGPT-X/Teuken-7B-base-v0.6) is intended for research use in all official 24 European languages. Since Teuken 7B-base-v0.6
53
  focuses on covering all 24 EU languages, it renders more stable results across these languages and better reflects European values in its answers than English-centric models. It is therefore specialized for use in multilingual tasks.
54
 
55
+ ## Disclaimer Toxic Content:
56
+
57
+ This Large Language Model (LLM) may generate content that is inappropriate, offensive, or harmful. While the dataset has been filtered to minimize such outputs, the model may still produce text that is biased or toxic due to the large scale and diverse nature of the data.
58
+
59
+
60
  ### Out-of-Scope Use
61
 
62
  <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
 
67
 
68
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
69
 
70
+ [Teuken 7B-base-v0.6](https://huggingface.co/openGPT-X/Teuken-7B-base-v0.6) is a base model and is not free from biases and hallucinations.
71
 
72
  ## How to Get Started with the Model
73
 
74
  ## Usage
75
+ The model requires a few libraries that can be installed in your python environment:
76
+
77
+ ```bash
78
+ python -m pip install numpy torch huggingface_hub transformers sentencepiece
79
+ ```
80
+
81
  After installation, here's an example of how to use the model:
82
 
83
  ```python
84
  import torch
85
  from transformers import AutoModelForCausalLM, AutoTokenizer
86
 
87
+ model_name = "openGPT-X/Teuken-7B-base-v0.6"
88
  prompt = "Insert text here..."
89
  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
90
  tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
 
103
 
104
  <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
105
 
106
+ [Teuken 7B-base-v0.6](https://huggingface.co/openGPT-X/Teuken-7B-base-v0.6) was pre-trained on 6 trillion tokens of data from publicly available sources.
107
 
108
  The pretraining data has a cutoff of September 2023.
109
 
 
121
 
122
  <!-- This section describes the evaluation protocols and provides the results. -->
123
 
124
+ Results on multilingual benchmarks for 21 European languages with instruction-tuned models
125
+
126
+
127
+ | Model | Avg | EU21-ARC | EU21-HeSw | EU21-TQA | EU21-MMLU |
128
+ | --- | --- | --- | --- | --- | --- |
129
+ | **Meta-Llama-3.1-8B** | **0.548** | 0.554 | 0.588 | **0.495** | **0.556** |
130
+ | Salamandra-7B | 0.523 | **0.589** | **0.637** | 0.449 | 0.417 |
131
+ | Mistral-7B-v0.3 | 0.505 | 0.513 | 0.534 | 0.472 | 0.501 |
132
+ | Occiglot-7B-eu5 | 0.464 | 0.470 | 0.511 | 0.448 | 0.426 |
133
+ | Pharia-1-LLM-7B-control | 0.409 | 0.393 | 0.433 | 0.456 | 0.353 |
134
+ | Bloom-7B1 | 0.348 | 0.319 | 0.355 | 0.464 | 0.256 |
135
+ | **Teuken-7B-Base (Ours)** | 0.520 | 0.558 | 0.619 | 0.449 | 0.453 |
136
+
137
+ More information regarding the quality of our translated benchmarks are available in our Evaluation preprint ["Towards Multilingual LLM Evaluation for European Languages"](https://arxiv.org/abs/2410.08928).
138
+ More evaluation results regarding [Teuken 7B-base-v0.6](https://huggingface.co/openGPT-X/Teuken-7B-base-v0.6) are available in our model preprint ["Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs"](https://arxiv.org/abs/2410.03730).
139
+
140
+ The model was evaluated in 21 languages on ARC, GSM8K, HellaSwag, TruthfulQA, Translation and MMLU. Results can also be seen in the [European LLM Leaderboard](https://huggingface.co/spaces/openGPT-X/european-llm-leaderboard).
141
+
142
+
143
  ### Testing Data, Factors & Metrics
144
 
145
  #### Testing Data
 
168
  | Num Query Groups | 2 |
169
  | Normalization | RMSNorm |
170
  | Learning rate | 3e-4 |
171
+ | Min learning rate | 1.5e-5 |
172
  | Disable bias in linear | yes |
173
  | Hidden dropout | 0.0 |
174
  | Attention dropout | 0.0 |
 
202
 
203
  #### Software
204
 
205
+ [Megatron-LM](https://github.com/OpenGPTX/Megatron-LM)
 
 
 
 
 
206
 
207
  ## Citation
208
 
 
 
 
209
  **BibTeX:**
210
 
211
+ If you find our model useful in your research, please consider citing our [preprint](https://arxiv.org/abs/2410.03730):
212
+ ```
 
 
 
 
 
213
 
214
+ @misc{ali2024teuken7bbaseteuken7binstructeuropean,
215
+ title={Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs},
216
+ author={Mehdi Ali and Michael Fromm and Klaudia Thellmann and Jan Ebert and Alexander Arno Weber and Richard Rutmann and Charvi Jain and Max Lübbering and Daniel Steinigen and Johannes Leveling and Katrin Klug and Jasper Schulze Buschhoff and Lena Jurkschat and Hammam Abdelwahab and Benny Jörg Stein and Karl-Heinz Sylla and Pavel Denisov and Nicolo' Brandizzi and Qasid Saleem and Anirban Bhowmick and Lennard Helmer and Chelsea John and Pedro Ortiz Suarez and Malte Ostendorff and Alex Jude and Lalith Manjunath and Samuel Weinbach and Carolin Penke and Oleg Filatov and Shima Asaadi and Fabio Barth and Rafet Sifa and Fabian Küch and Andreas Herten and René Jäkel and Georg Rehm and Stefan Kesselheim and Joachim Köhler and Nicolas Flores-Herr},
217
+ year={2024},
218
+ eprint={2410.03730},
219
+ archivePrefix={arXiv},
220
+ primaryClass={cs.CL},
221
+ url={https://arxiv.org/abs/2410.03730},
222
+ }
223
+ ```
224
 
225
  # Team
226
  ## Data Team
 
236
  ### Contributors:
237
  Shima Assadi (IIS), Fabio Barth (DFKI)
238
  ## Management
239
+ Joachim Köhler (IAIS), Nicolas Flores-Herr (IAIS), Stefan Kesselheim (FZJ), Andreas Herten (FZJ), Georg Rehm (DFKI), René Jäkel (TUD), Fabian Küch (IIS), Nicole Hildebrandt (IAIS), Ines Wendler (IAIS)
240
+
241
+ We believe that collaboration is key to overcome the aforementioned limitations and thereby strengthening the European GenAI landscape. Because of this, the team invites researchers, developers, and AI enthusiasts to join and engage through various platforms. A Discord server has been created for community collaboration, offering a space for discussions on technical details, ideas, and direct interaction with developers. Additionally, resources like research publications and a European LLM Leaderboard provide insights into Teuken-7B’s performance and technical aspects. The OpenGPT-X team encourages ongoing engagement and collaboration as the project evolves.
242
+ Key links:
243
+ Discord: OpenGPT-X [Discord server](https://discord.com/invite/RvdHpGMvB3)
244
+ Research Papers: OpenGPT-X News [Research Papers](https://opengpt-x.de/en/news-en/)
245
+ LLM Leaderboard: European LLM Leaderboard [LLM Leaderboard](https://huggingface.co/spaces/openGPT-X/european-llm-leaderboard)
246
+
247
+ <div class="hf-card">
248
+ <h2>Contact Information</h2>
249
+ <p>You can reach out to the following model card contact:</p>
250
+ <ul>
251
+ <li>
252
+ <a href="https://huggingface.co/openGPT-X" target="_blank">OpenGPT-X</a>
253
254
+ </li>
255
+ </ul>
256
+ </div>