openeurollm
/

datamix-2b-80-20

Model card Files Files and versions

vitiugin commited on 29 days ago

Commit

06fde05

·

verified ·

1 Parent(s): 289c9ed

Create README.md

Files changed (1) hide show

README.md +71 -0

README.md ADDED Viewed

	@@ -0,0 +1,71 @@

+---
+license: apache-2.0
+datasets:
+- HPLT/HPLT2.0_cleaned
+- nvidia/Nemotron-CC-v2
+- HuggingFaceTB/finemath
+language:
+- en
+- bg
+- cs
+- da
+- de
+- el
+- et
+- fi
+- fr
+- ga
+- hr
+- hu
+- it
+- lt
+- lv
+- mt
+- nl
+- pl
+- pt
+- ro
+- sk
+- sl
+- es
+- sv
+- ca
+- eu
+- gl
+- bs
+- ka
+- mk
+- sq
+- sr
+- tr
+- uk
+- is
+- 'no'
+---
+# Model Details
+This is a decoder-only model with approximately 2.15B parameters. The architecture largely follows the Llama design, with the following key hyperparameters:
+- Hidden Size: 2048
+- Attention Heads: 32
+- Layers: 24
+- Sequence Length: 2048
+# Training Data
+The training data is a diverse and multilingual dataset, combined high-quality English, code, math, and European language corpora. The total token budget for training is 4 trillion tokens. The training mixture is comprised of the following datasets:
+- English: A mixture of Nemotron-CC high-actual and medium-high-actual datasets.
+- Code: The StarCoder dataset.
+- Math: The FineMath 4+ dataset.
+- Multilingual: The cleaned version of HPLT 2.0, covering 36 official EU and partner languages.
+The final data split is based on a predefined proportion of English, code, and math, with the remaining token budget allocated to other languages. A minimum proportion for each language in a mix is 0.05% (with the exception of nno_Latn, which is combined with nob_Latn for proportion calculation).
+![detailed_data_8020](https://cdn-uploads.huggingface.co/production/uploads/618bf745f723a0c1e7f2ce6d/0uf8Lik572klwSmzA5ihU.png)
+# Tokenizer
+The model utilizes Gemma-3 tokenizer — a SentencePiece tokenizer with a 262K vocabulary size. It supports over 140 languages, which contributed to the model's multilingual performance.
+# Training Information
+The model was trained using the Megatron-LM framework on the LUMI HPC supercomputer. The training utilized 64 AMD MI250x nodes, totaling approximately 165000 GPU hours.
+Intermediate Checkpoints
+We have released intermediate checkpoints to provide access to the model's training progression. These checkpoints are available in separate branches, with a new checkpoint released every 5000 training steps.
+The naming convention is `checkpoint_0xxxxx00`. For example, the checkpoint for 50000 iterations is named `checkpoint_0050000`. The available checkpoints range from `checkpoint_0005000` up to `checkpoint_0953675`. The final checkpoint, `checkpoint_0953675`, is located in the main branch.