kazukifujii commited on
Commit
4e5ca43
·
verified ·
1 Parent(s): 934d145

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +103 -0
README.md ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama3.3
3
+ datasets:
4
+ - tokyotech-llm/swallow-code
5
+ language:
6
+ - en
7
+ - ja
8
+ base_model:
9
+ - meta-llama/Llama-3.1-8B
10
+ ---
11
+
12
+ # Model Card
13
+
14
+ <img src="https://huggingface.co/datasets/tokyotech-llm/swallow-math/resolve/main/figures/swallow-code-math-log.png" alt="SwallowMath Icon" width="600">
15
+
16
+ ## Model Summary
17
+
18
+ This model is a continual pre-training of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) on a mix of the Python subset of [The-Stack-v2-train-smol-ids](https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids) (from [SwallowCode, Experiment 1](https://huggingface.co/datasets/tokyotech-llm/swallow-code)) and multilingual text datasets.
19
+ The model was trained to evaluate the baseline performance of unfiltered Python code from The-Stack-v2 in the SwallowCode ablation experiments.
20
+
21
+ The model shows baseline performance on code generation tasks (HumanEval and HumanEval+) and maintains general capabilities on knowledge, reasoning, and common sense benchmarks.
22
+ It serves as a reference point for comparing subsequent ablation experiments in the SwallowCode pipeline.
23
+
24
+ It was trained on **50 billion tokens** using a mix of 16% SwallowCode (Experiment 1, Python subset) and 84% multilingual text, following the setup described in the [SwallowCode paper](https://arxiv.org/abs/XXXX.XXXXX).
25
+ Training was performed using [Megatron-LM](https://github.com/NVIDIA/Megatron-LM/tree/core_r0.9.0).
26
+
27
+ ## Use
28
+
29
+ ### Generation
30
+
31
+ ```python
32
+ # pip install -q transformers
33
+ from transformers import AutoModelForCausalLM, AutoTokenizer
34
+
35
+ model = "tokyotech-llm/<model-name>"
36
+ device = "cuda" # for GPU usage or "cpu" for CPU usage
37
+
38
+ tokenizer = AutoTokenizer.from_pretrained(model)
39
+ model = AutoModelForCausalLM.from_pretrained(model).to(device)
40
+
41
+ inputs = tokenizer.encode("def fibonacci(n):", return_tensors="pt").to(device)
42
+ outputs = model.generate(inputs, max_length=100)
43
+ print(tokenizer.decode(outputs[0]))
44
+ ```
45
+
46
+ ## Training
47
+ ### Model
48
+ - **Architecture**: Llama-3.1
49
+ - **Pretraining tokens**: 50B
50
+ - **Precision**: bfloat16
51
+ - **Sequence length**: 8,192
52
+ - **Tokenizer**: Llama-3 tokenizer
53
+
54
+ ### Data
55
+ The training mix consists of:
56
+
57
+ - 16% Code: Python subset of The-Stack-v2-train-smol-ids (8B tokens), from SwallowCode, Experiment 1.
58
+ - 84% Multilingual Text:
59
+ - Japanese Wikipedia (0.84B tokens)
60
+ - Japanese Swallow Corpus v2 (26.1B tokens)
61
+ - Laboro-ParaCorpus (0.22B tokens)
62
+ - English Wikipedia (1.1B tokens)
63
+ - English Cosmopedia (3.7B tokens)
64
+ - English DCLM (10.0B tokens)
65
+
66
+ Details are in the paper’s Appendix.
67
+
68
+ ### Hardware
69
+ - GPUs: 64 NVIDIA H100 (94GB)
70
+ - Interconnect: InfiniBand NDR200
71
+ - Supercomputer: TSUBAME, Institute of Science Tokyo
72
+
73
+ ### Software
74
+ - Megatron-LM (version core_r0.9.0) for training
75
+ - lm-evaluation-harness for evaluation
76
+ - BigCodeBench for code evaluation
77
+
78
+ ## Evaluation
79
+ The model was evaluated using the setup described in the SwallowCode paper, with the lm-evaluation-harness and BigCodeBench. Benchmarks include code generation (HumanEval, HumanEval+) and general tasks (OpenBookQA, TriviaQA, HellaSwag, SQuAD 2.0, XWINO, MMLU, GSM8K, BBH). Results are reported for checkpoints at 10B, 20B, 30B, 40B, and 50B tokens.
80
+
81
+ Evaluation Results (Experiment 1)
82
+
83
+ |Tokens (B) | OpenBookQA |TriviaQA| HellaSwag| SQuAD2.0| XWINO| MMLU| GSM8K| BBH| HumanEval| HumanEval+|
84
+ |---|---|---|---|---|---|---|---|---|---|---|
85
+ |10 |0.3640| 0.6659| 0.5995| 0.3354| 0.9032| 0.6294| 0.4602| 0.6019| 0.3366| 0.3366|
86
+ |20 |0.3540| 0.6567| 0.6019| 0.3360| 0.9024| 0.6238| 0.4852| 0.5898| 0.3433| 0.3433|
87
+ |30 |0.3700| 0.6588| 0.6034| 0.3377| 0.9045| 0.6263| 0.5072| 0.5939| 0.3402| 0.3421|
88
+ |40 |0.3800| 0.6618| 0.6053| 0.3380| 0.9097| 0.6341| 0.5011| 0.6016| 0.3659| 0.3701|
89
+ |50 |0.3700| 0.6679| 0.6054| 0.3350| 0.9045| 0.6340| 0.5027| 0.6091| 0.3689| 0.3720|
90
+
91
+ ## Citation
92
+
93
+ ```bibtex
94
+ @misc{fujii2025rewritingpretrainingdata,
95
+ title={Rewriting Pre-Training Data: Boosting LLM Performance in Math and Code},
96
+ author={Kazuki Fujii and Yukito Tajima and Sakae Mizuki and Hinari Shimada and Taihei Shiotani and Koshiro Saito and Masanari Ohi and Masaki Kawamura and Taishi Nakamura and Takumi Okamoto and Shigeki Ishida and Kakeru Hattori and Youmi Ma and Hiroya Takamura and Rio Yokota and Naoaki Okazaki},
97
+ year={2025},
98
+ eprint={XXXX.XXXXX},
99
+ archivePrefix={arXiv},
100
+ primaryClass={cs.CL},
101
+ url={https://arxiv.org/abs/XXXX.XXXXX},
102
+ }
103
+ ```