Text Generation
Transformers
PyTorch
Safetensors
llama
text-generation-inference
mfromm commited on
Commit
ce14779
·
verified ·
1 Parent(s): caf0502

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +223 -0
README.md ADDED
@@ -0,0 +1,223 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - de
4
+ - bg
5
+ - cs
6
+ - da
7
+ - el
8
+ - en
9
+ - es
10
+ - et
11
+ - fi
12
+ - fr
13
+ - ga
14
+ - hr
15
+ - hu
16
+ - it
17
+ - lt
18
+ - lv
19
+ - mt
20
+ - nl
21
+ - pl
22
+ - pt
23
+ - ro
24
+ - sl
25
+ - sv
26
+ - sk
27
+ metrics:
28
+ - accuracy
29
+ - bleu
30
+ pipeline_tag: text-generation
31
+ library_name: transformers
32
+ license: apache-2.0
33
+ ---
34
+ # Model Card for Teuken-7B-base-v0.6
35
+
36
+ Teuken-7B-base-v0.6 is a 7B parameter multilingual large language model (LLM) pre-trained with 6T tokens within the research project OpenGPT-X.
37
+
38
+
39
+ ### Model Description
40
+
41
+ <!-- Provide a longer summary of what this model is. -->
42
+
43
+ - **Developed by:** Fraunhofer, Forschungszentrum Jülich, TU Dresden, DFKI
44
+ - **Funded by:** German Federal Ministry of Economics and Climate Protection (BMWK) in the context of the OpenGPT-X project
45
+ - **Model type:** Transformer based decoder-only model
46
+ - **Language(s) (NLP):** bg, cs, da, de, el, en, es, et, fi, fr, ga, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv
47
+ - **Shared by:** OpenGPT-X
48
+
49
+ ## Uses
50
+
51
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
52
+ Teuken-7B-base-v0.6 is intended for commercial and research use in all official 24 European languages. Since Teuken-7B-base-v0.6
53
+ focuses on covering all 24 EU languages, it renders more stable results across these languages and better reflects European values in its answers than English-centric models. It is therefore specialized for use in multilingual tasks.
54
+
55
+ ### Out-of-Scope Use
56
+
57
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
58
+
59
+ The model is not intended for use in math and coding tasks.
60
+
61
+ ## Bias, Risks, and Limitations
62
+
63
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
64
+
65
+ Teuken-7B-base-v0.6 as a base model is not free from biases and hallucinations. It is therefore recommended to instruction tune it to fit it to the user's purposes and minimize biases and any risks arising. Finetuned models limiting risks and biases will appear soon after the release of the base model as a community effort.
66
+
67
+ ## How to Get Started with the Model
68
+
69
+ ## Usage
70
+ The model requires transformers, sentencepiece, and the torch library.
71
+ After installation, here's an example of how to use the model:
72
+
73
+ ```python
74
+ import torch
75
+ from transformers import AutoModelForCausalLM, AutoTokenizer
76
+
77
+ model_name = "EuropeanLLM-Beta/Teuken-7B-base-v0.6"
78
+ prompt = "Insert text here..."
79
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
80
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
81
+ model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True).to(device)
82
+ inputs = tokenizer(prompt, return_tensors="pt")
83
+ inputs = {k: v.to(device) for k, v in inputs.items()} # Move inputs to the same device as the model
84
+ output = model.generate(input_ids=inputs['input_ids'], max_new_tokens=1000, do_sample=True)
85
+ result = tokenizer.decode(output.tolist())
86
+ ```
87
+
88
+ This example demonstrates how to load the model and tokenizer, prepare input, generate text, and print the result.
89
+
90
+ ## Training Details
91
+
92
+ ### Training Data
93
+
94
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
95
+
96
+ Teuken-7B-base-v0.6 was pre-trained on 5.5 trillion tokens of data from publicly available sources.
97
+
98
+ The pretraining data has a cutoff of September 2023.
99
+
100
+ ### Training Procedure
101
+
102
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
103
+ Transformer-based decoder-only model that has been trained based on the causal language modeling objective.
104
+
105
+
106
+ #### Training Hyperparameters
107
+
108
+ - **Training regime:** bf16 mixed precision <!--fp32, fp16 mixed precision, , bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
109
+
110
+ ## Evaluation
111
+
112
+ <!-- This section describes the evaluation protocols and provides the results. -->
113
+
114
+ ### Testing Data, Factors & Metrics
115
+
116
+ #### Testing Data
117
+
118
+ <!-- This should link to a Dataset Card if possible. -->
119
+
120
+ The model was evaluated in 21 languages on ARC, GSM8K, HellaSwag, TruthfulQA, Translation and MMLU. Results can be seen in the European LLM Leaderboard (https://huggingface.co/spaces/openGPT-X/european-llm-leaderboard).
121
+
122
+
123
+ ## Technical Specifications
124
+
125
+ ### Model Architecture and Objective
126
+
127
+ | Hyper-Parameter | Value |
128
+ |----------------------------|----------|
129
+ | Training Objective | CLM |
130
+ | Activation Function | SwiGLU |
131
+ | Seq Length | 4096 |
132
+ | Position Embeddings | Rotary |
133
+ | Num Layers | 32 |
134
+ | Hidden Size | 4096 |
135
+ | FFN Hidden Size | 13440 |
136
+ | Num Attention Heads | 32 |
137
+ | Head Dim | 128 |
138
+ | Group Query Attention | yes |
139
+ | Num Query Groups | 2 |
140
+ | Normalization | RMSNorm |
141
+ | Learning rate | 3e-4 |
142
+ | Min learning rate | 3e-5 |
143
+ | Disable bias in linear | yes |
144
+ | Hidden dropout | 0.0 |
145
+ | Attention dropout | 0.0 |
146
+ | Optimizer | AdamW |
147
+ | Beta1 | 0.9 |
148
+ | Beta2 | 0.95 |
149
+ | Sequence-parallelism
150
+ | Data-type | bf16 |
151
+ | Recompute-activations | yes |
152
+ | Distributed-optimizers | yes |
153
+ | Model Initialization | |
154
+
155
+
156
+ ### Compute Infrastructure
157
+
158
+ We trained our models on JUWELS Booster which consists of 936 compute nodes, each equipped with 4 NVIDIA A100 GPUs. The GPUs are hosted by AMD EPYC Rome CPUs. The compute nodes are connected with HDR-200 InfiniBand in a DragonFly+ topology.
159
+
160
+ #### Hardware
161
+
162
+ The configuration of JUWELS Booster compute nodes is the following:
163
+
164
+ CPU: AMD EPYC 7402 processor; 2 sockets, 24 cores per socket, SMT-2 (total: 2×24×2 = 96 threads) in NPS-4 1 configuration
165
+
166
+ Memory: 512 GB DDR4-3200 RAM (of which at least 20 GB is taken by the system software stack, including the file system); 256 GB per socket; 8 memory channels per socket (2 channels per NUMA domain)
167
+
168
+ GPU: 4 × NVIDIA A100 Tensor Core GPU with 40 GB; connected via NVLink3 to each other
169
+
170
+ Network: 4 × Mellanox HDR200 InfiniBand ConnectX 6 (200 Gbit/s each), HCA
171
+
172
+ Periphery: CPU, GPU, and network adapter are connected via 2 PCIe Gen 4 switches with 16 PCIe lanes going to each device (CPU socket: 2×16 lanes). PCIe switches are configured in synthetic mode.
173
+
174
+ #### Software
175
+
176
+ https://github.com/OpenGPTX/Megatron-LM
177
+
178
+ ### Toxic Content
179
+
180
+ This Language Model (LLM) may generate content that is inappropriate, offensive, or harmful. While the dataset has been heavily filtered to minimize such outputs,
181
+ the model may still produce text that is biased or toxic due to the large scale and diverse nature of the data.
182
+
183
+ ## Citation
184
+
185
+ TODO
186
+
187
+
188
+ **BibTeX:**
189
+
190
+ TODO
191
+
192
+ **APA:**
193
+
194
+ TODO
195
+
196
+ ## Model Card Contact
197
+
198
+ <div class="hf-card">
199
+ <h2>Contact Information</h2>
200
+ <p>You can reach out to the following model card contact:</p>
201
+ <ul>
202
+ <li>
203
+ <a href="https://huggingface.co/openGPT-X" target="_blank">OpenGPT-X</a>
204
+ - <a href="mailto:[email protected]">[email protected]</a>
205
+ </li>
206
+ </ul>
207
+ </div>
208
+
209
+ # Team
210
+ ## Data Team
211
+ Anirban Bhowmick (IAIS), Nicolo Brandizzi (IAIS), Lennard Helmer (IAIS), Benny Jörg Stein (IAIS), Karl-Heinz Sylla (IAIS), Pavel Denisov (IAIS), Qasid Saleem (IAIS), Johannes Leveling (IAIS), Hammam Abdelwahab (IAIS), Luzian Hahn (IIS), Farzad Naderi (IIS), Md Saiful Islam (IIS), Alexander Schwirjow (IIS), Pedro Ortiz Suarez (ex. DFKI), Malte Ostendorff (ex. DFKI)
212
+ ## Model-Training Team
213
+ ### Core contributors
214
+ Mehdi Ali (IAIS), Michael Fromm (IAIS), Jan Ebert (FZJ), Chelsea John (FZJ), Lena Jurkschat (TUD), Alexander Weber (IAIS)
215
+ ### Contributors:
216
+ Richard Rutmann (IAIS), Daniel Steinigen (IAIS), Lalith Manjunath (TUD), Carolin Penke (FZJ)
217
+ ## Evaluation Team
218
+ ### Core contributors
219
+ Klaudia Thellmann (TUD), Alex Jude (IAIS), Jasper Buschhoff (IAIS)
220
+ ### Contributors:
221
+ Shima Assadi (IIS), Fabio Barth (DFKI)
222
+ ## Management
223
+ Joachim Köhler (IAIS), Nicolas Flores-Herr (IAIS), Stefan Kesselheim (FZJ), Andreas Herten (FZJ), Georg Rehm (DFKI), René Jäkel (TUD), Fabian Küch (IIS), Nicole Hildebrandt (IAIS), Ines Wendler (IAIS)