PyTorch
llama
vitiugin commited on
Commit
f293b93
·
verified ·
1 Parent(s): 0c2ad96

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +72 -0
README.md ADDED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - HPLT/HPLT2.0_cleaned
5
+ - nvidia/Nemotron-CC-v2
6
+ - HuggingFaceTB/finemath
7
+ - bigcode/starcoderdata
8
+ language:
9
+ - en
10
+ - bg
11
+ - cs
12
+ - da
13
+ - de
14
+ - el
15
+ - et
16
+ - fi
17
+ - fr
18
+ - ga
19
+ - hr
20
+ - hu
21
+ - it
22
+ - lt
23
+ - lv
24
+ - mt
25
+ - nl
26
+ - pl
27
+ - pt
28
+ - ro
29
+ - sk
30
+ - sl
31
+ - es
32
+ - sv
33
+ - ca
34
+ - eu
35
+ - gl
36
+ - bs
37
+ - ka
38
+ - mk
39
+ - sq
40
+ - sr
41
+ - tr
42
+ - uk
43
+ - is
44
+ - 'no'
45
+ ---
46
+ # Model Details
47
+ This is a decoder-only model with approximately 2.15B parameters. The architecture largely follows the Llama design, with the following key hyperparameters:
48
+ - Hidden Size: 2048
49
+ - Attention Heads: 32
50
+ - Layers: 24
51
+ - Sequence Length: 2048
52
+
53
+ # Training Data
54
+ The training data is a diverse and multilingual dataset, combined high-quality English, code, math, and European language corpora. The total token budget for training is 4 trillion tokens. The training mixture is comprised of the following datasets:
55
+ - English: A mixture of Nemotron-CC high-actual and medium-high-actual datasets.
56
+ - Code: The StarCoder dataset.
57
+ - Math: The FineMath 4+ dataset.
58
+ - Multilingual: The cleaned version of HPLT 2.0, covering 36 official EU and partner languages.
59
+
60
+ The final data split is based on a predefined proportion of English, code, and math, with the remaining token budget allocated to other languages. A minimum proportion for each language in a mix is 0.05% (with the exception of nno_Latn, which is combined with nob_Latn for proportion calculation).
61
+
62
+ ![detailed_data_9010](https://cdn-uploads.huggingface.co/production/uploads/618bf745f723a0c1e7f2ce6d/JpJ2MUuSST4RnwOUIiOLB.png)
63
+
64
+ # Tokenizer
65
+ The model utilizes Gemma-3 tokenizer — a SentencePiece tokenizer with a 262K vocabulary size. It supports over 140 languages, which contributed to the model's multilingual performance.
66
+
67
+ # Training Information
68
+ The model was trained using the Megatron-LM framework on the LUMI HPC supercomputer. The training utilized 64 AMD MI250x nodes, totaling approximately 165000 GPU hours.
69
+ Intermediate Checkpoints
70
+ We have released intermediate checkpoints to provide access to the model's training progression. These checkpoints are available in separate branches, with a new checkpoint released every 5000 training steps.
71
+
72
+ The naming convention is `checkpoint_0xxxxx00`. For example, the checkpoint for 50000 iterations is named `checkpoint_0050000`. The available checkpoints range from `checkpoint_0005000` up to `checkpoint_0953675`. The final checkpoint, `checkpoint_0953675`, is located in the main branch.