Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,71 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
datasets:
|
| 4 |
+
- HPLT/HPLT2.0_cleaned
|
| 5 |
+
- nvidia/Nemotron-CC-v2
|
| 6 |
+
- HuggingFaceTB/finemath
|
| 7 |
+
language:
|
| 8 |
+
- en
|
| 9 |
+
- bg
|
| 10 |
+
- cs
|
| 11 |
+
- da
|
| 12 |
+
- de
|
| 13 |
+
- el
|
| 14 |
+
- et
|
| 15 |
+
- fi
|
| 16 |
+
- fr
|
| 17 |
+
- ga
|
| 18 |
+
- hr
|
| 19 |
+
- hu
|
| 20 |
+
- it
|
| 21 |
+
- lt
|
| 22 |
+
- lv
|
| 23 |
+
- mt
|
| 24 |
+
- nl
|
| 25 |
+
- pl
|
| 26 |
+
- pt
|
| 27 |
+
- ro
|
| 28 |
+
- sk
|
| 29 |
+
- sl
|
| 30 |
+
- es
|
| 31 |
+
- sv
|
| 32 |
+
- ca
|
| 33 |
+
- eu
|
| 34 |
+
- gl
|
| 35 |
+
- bs
|
| 36 |
+
- ka
|
| 37 |
+
- mk
|
| 38 |
+
- sq
|
| 39 |
+
- sr
|
| 40 |
+
- tr
|
| 41 |
+
- uk
|
| 42 |
+
- is
|
| 43 |
+
- 'no'
|
| 44 |
+
---
|
| 45 |
+
# Model Details
|
| 46 |
+
This is a decoder-only model with approximately 2.15B parameters. The architecture largely follows the Llama design, with the following key hyperparameters:
|
| 47 |
+
- Hidden Size: 2048
|
| 48 |
+
- Attention Heads: 32
|
| 49 |
+
- Layers: 24
|
| 50 |
+
- Sequence Length: 2048
|
| 51 |
+
|
| 52 |
+
# Training Data
|
| 53 |
+
The training data is a diverse and multilingual dataset, combined high-quality English, code, math, and European language corpora. The total token budget for training is 4 trillion tokens. The training mixture is comprised of the following datasets:
|
| 54 |
+
- English: A mixture of Nemotron-CC high-actual and medium-high-actual datasets.
|
| 55 |
+
- Code: The StarCoder dataset.
|
| 56 |
+
- Math: The FineMath 4+ dataset.
|
| 57 |
+
- Multilingual: The cleaned version of HPLT 2.0, covering 36 official EU and partner languages.
|
| 58 |
+
|
| 59 |
+
The final data split is based on a predefined proportion of English, code, and math, with the remaining token budget allocated to other languages. A minimum proportion for each language in a mix is 0.05% (with the exception of nno_Latn, which is combined with nob_Latn for proportion calculation).
|
| 60 |
+
|
| 61 |
+

|
| 62 |
+
|
| 63 |
+
# Tokenizer
|
| 64 |
+
The model utilizes Gemma-3 tokenizer — a SentencePiece tokenizer with a 262K vocabulary size. It supports over 140 languages, which contributed to the model's multilingual performance.
|
| 65 |
+
|
| 66 |
+
# Training Information
|
| 67 |
+
The model was trained using the Megatron-LM framework on the LUMI HPC supercomputer. The training utilized 64 AMD MI250x nodes, totaling approximately 165000 GPU hours.
|
| 68 |
+
Intermediate Checkpoints
|
| 69 |
+
We have released intermediate checkpoints to provide access to the model's training progression. These checkpoints are available in separate branches, with a new checkpoint released every 5000 training steps.
|
| 70 |
+
|
| 71 |
+
The naming convention is `checkpoint_0xxxxx00`. For example, the checkpoint for 50000 iterations is named `checkpoint_0050000`. The available checkpoints range from `checkpoint_0005000` up to `checkpoint_0953675`. The final checkpoint, `checkpoint_0953675`, is located in the main branch.
|