PyTorch
English
llama

Model Details

This is a decoder-only model with approximately 2.15B parameters. The architecture largely follows the Llama design, with the following key hyperparameters:

  • Hidden Size: 2048
  • Attention Heads: 32
  • Layers: 24
  • Sequence Length: 2048

Training Data

The training data is a diverse dataset, combined high-quality English, code, and math. The total token budget for training is 4 trillion tokens. The training mixture is comprised of the following datasets:

  • English: A mixture of Nemotron-CC high-actual and medium-high-actual datasets.
  • Code: The StarCoder dataset.
  • Math: The FineMath 4+ dataset.

The final data split is based on a predefined proportion of English, code, and math, with the remaining token budget allocated to other languages.

detailed_data_en

Tokenizer

The model utilizes Gemma-3 tokenizer — a SentencePiece tokenizer with a 262K vocabulary size. It supports over 140 languages, which contributed to the model's multilingual performance.

Training Information

The model was trained using the Megatron-LM framework on the LUMI HPC supercomputer. The training utilized 64 AMD MI250x nodes, totaling approximately 165000 GPU hours. Intermediate Checkpoints We have released intermediate checkpoints to provide access to the model's training progression. These checkpoints are available in separate branches, with a new checkpoint released every 5000 training steps.

The naming convention is checkpoint_0xxxxx00. For example, the checkpoint for 50000 iterations is named checkpoint_0050000. The available checkpoints range from checkpoint_0005000 up to checkpoint_0953675. The final checkpoint, checkpoint_0953675, is located in the main branch.

Downloads last month
195
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for openeurollm/datamix-2b-en

Quantizations
2 models

Datasets used to train openeurollm/datamix-2b-en