Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,72 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
datasets:
|
4 |
+
- HPLT/HPLT2.0_cleaned
|
5 |
+
- nvidia/Nemotron-CC-v2
|
6 |
+
- HuggingFaceTB/finemath
|
7 |
+
- bigcode/starcoderdata
|
8 |
+
language:
|
9 |
+
- en
|
10 |
+
- bg
|
11 |
+
- cs
|
12 |
+
- da
|
13 |
+
- de
|
14 |
+
- el
|
15 |
+
- et
|
16 |
+
- fi
|
17 |
+
- fr
|
18 |
+
- ga
|
19 |
+
- hr
|
20 |
+
- hu
|
21 |
+
- it
|
22 |
+
- lt
|
23 |
+
- lv
|
24 |
+
- mt
|
25 |
+
- nl
|
26 |
+
- pl
|
27 |
+
- pt
|
28 |
+
- ro
|
29 |
+
- sk
|
30 |
+
- sl
|
31 |
+
- es
|
32 |
+
- sv
|
33 |
+
- ca
|
34 |
+
- eu
|
35 |
+
- gl
|
36 |
+
- bs
|
37 |
+
- ka
|
38 |
+
- mk
|
39 |
+
- sq
|
40 |
+
- sr
|
41 |
+
- tr
|
42 |
+
- uk
|
43 |
+
- is
|
44 |
+
- 'no'
|
45 |
+
---
|
46 |
+
# Model Details
|
47 |
+
This is a decoder-only model with approximately 2.15B parameters. The architecture largely follows the Llama design, with the following key hyperparameters:
|
48 |
+
- Hidden Size: 2048
|
49 |
+
- Attention Heads: 32
|
50 |
+
- Layers: 24
|
51 |
+
- Sequence Length: 2048
|
52 |
+
|
53 |
+
# Training Data
|
54 |
+
The training data is a diverse and multilingual dataset, combined high-quality English, code, math, and European language corpora. The total token budget for training is 4 trillion tokens. The training mixture is comprised of the following datasets:
|
55 |
+
- English: A mixture of Nemotron-CC high-actual and medium-high-actual datasets.
|
56 |
+
- Code: The StarCoder dataset.
|
57 |
+
- Math: The FineMath 4+ dataset.
|
58 |
+
- Multilingual: The cleaned version of HPLT 2.0, covering 36 official EU and partner languages.
|
59 |
+
|
60 |
+
The final data split is based on a predefined proportion of English, code, and math, with the remaining token budget allocated to other languages. A minimum proportion for each language in a mix is 0.05% (with the exception of nno_Latn, which is combined with nob_Latn for proportion calculation).
|
61 |
+
|
62 |
+

|
63 |
+
|
64 |
+
# Tokenizer
|
65 |
+
The model utilizes Gemma-3 tokenizer — a SentencePiece tokenizer with a 262K vocabulary size. It supports over 140 languages, which contributed to the model's multilingual performance.
|
66 |
+
|
67 |
+
# Training Information
|
68 |
+
The model was trained using the Megatron-LM framework on the LUMI HPC supercomputer. The training utilized 64 AMD MI250x nodes, totaling approximately 165000 GPU hours.
|
69 |
+
Intermediate Checkpoints
|
70 |
+
We have released intermediate checkpoints to provide access to the model's training progression. These checkpoints are available in separate branches, with a new checkpoint released every 5000 training steps.
|
71 |
+
|
72 |
+
The naming convention is `checkpoint_0xxxxx00`. For example, the checkpoint for 50000 iterations is named `checkpoint_0050000`. The available checkpoints range from `checkpoint_0005000` up to `checkpoint_0953675`. The final checkpoint, `checkpoint_0953675`, is located in the main branch.
|