Model Details
This is a decoder-only model with approximately 2.15B parameters. The architecture largely follows the Llama design, with the following key hyperparameters:
- Hidden Size: 2048
- Attention Heads: 32
- Layers: 24
- Sequence Length: 2048
Training Data
The training data is a diverse and multilingual dataset, combined high-quality English, code, math, and European language corpora. The total token budget for training is 4 trillion tokens. The training mixture is comprised of the following datasets:
- English: A mixture of Nemotron-CC high-actual and medium-high-actual datasets.
- Code: The StarCoder dataset.
- Math: The FineMath 4+ dataset.
- Multilingual: The cleaned version of HPLT 2.0, covering 36 official EU and partner languages.
The final data split is based on a predefined proportion of English, code, and math, with the remaining token budget allocated to other languages. A minimum proportion for each language in a mix is 0.05% (with the exception of nno_Latn, which is combined with nob_Latn for proportion calculation).
Tokenizer
The model utilizes Gemma-3 tokenizer — a SentencePiece tokenizer with a 262K vocabulary size. It supports over 140 languages, which contributed to the model's multilingual performance.
Training Information
The model was trained using the Megatron-LM framework on the LUMI HPC supercomputer. The training utilized 64 AMD MI250x nodes, totaling approximately 165000 GPU hours. Intermediate Checkpoints We have released intermediate checkpoints to provide access to the model's training progression. These checkpoints are available in separate branches, with a new checkpoint released every 5000 training steps.
The naming convention is checkpoint_0xxxxx00
. For example, the checkpoint for 50000 iterations is named checkpoint_0050000
. The available checkpoints range from checkpoint_0005000
up to checkpoint_0953675
. The final checkpoint, checkpoint_0953675
, is located in the main branch.
- Downloads last month
- 470