Model Details

This is a decoder-only model with approximately 2.15B parameters. The architecture largely follows the Llama design, with the following key hyperparameters:

Hidden Size: 2048
Attention Heads: 32
Layers: 24
Sequence Length: 2048

Training Data

The training data is a diverse and multilingual dataset, combined high-quality English, code, math, and European language corpora. The total token budget for training is 4 trillion tokens. The training mixture is comprised of the following datasets:

English: A mixture of Nemotron-CC high-actual and medium-high-actual datasets.
Code: The StarCoder dataset.
Math: The FineMath 4+ dataset.
Multilingual: The cleaned version of HPLT 2.0, covering 36 official EU and partner languages.

The final data split is based on a predefined proportion of English, code, and math, with the remaining token budget allocated to other languages. A minimum proportion for each language in a mix is 0.05% (with the exception of nno_Latn, which is combined with nob_Latn for proportion calculation).

Tokenizer

The model utilizes Gemma-3 tokenizer — a SentencePiece tokenizer with a 262K vocabulary size. It supports over 140 languages, which contributed to the model's multilingual performance.

Training Information

The model was trained using the Megatron-LM framework on the LUMI HPC supercomputer. The training utilized 64 AMD MI250x nodes, totaling approximately 165000 GPU hours. Intermediate Checkpoints We have released intermediate checkpoints to provide access to the model's training progression. These checkpoints are available in separate branches, with a new checkpoint released every 5000 training steps.

The naming convention is checkpoint_0xxxxx00. For example, the checkpoint for 50000 iterations is named checkpoint_0050000. The available checkpoints range from checkpoint_0005000 up to checkpoint_0953675. The final checkpoint, checkpoint_0953675, is located in the main branch.

Downloads last month: 470

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

openeurollm
/

datamix-2b-90-10

Model Details

Training Data

Tokenizer

Training Information

Datasets used to train openeurollm/datamix-2b-90-10