kazukifujii commited on
Commit
6c9fb05
·
verified ·
1 Parent(s): c38baa2

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +116 -0
README.md ADDED
@@ -0,0 +1,116 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama3.3
3
+ datasets:
4
+ - tokyotech-llm/swallow-code
5
+ language:
6
+ - en
7
+ - ja
8
+ base_model:
9
+ - meta-llama/Llama-3.1-8B
10
+ ---
11
+
12
+ # Model Card
13
+
14
+ <img src="https://huggingface.co/datasets/tokyotech-llm/swallow-math/resolve/main/figures/swallow-code-math-log.png" alt="SwallowCodeMath Icon" width="600">
15
+
16
+ <img src="https://huggingface.co/datasets/tokyotech-llm/swallow-code/resolve/main/assets/experiments.png" width="800">
17
+
18
+ ## Model Summary
19
+
20
+ This model is a continual pre-training of Llama-3.1-8B on the SwallowCode ablation and multilingual text datasets.
21
+ The model was trained to evaluate the performance of pylint-filtered Python code from The-Stack-v2 in the SwallowCode ablation experiments.
22
+
23
+ It was trained on 50 billion tokens using a mix of 16% SwallowCode (Experiment 3) and 84% multilingual text, following the setup described in the SwallowCode paper.
24
+
25
+ Training was performed using Megatron-LM.
26
+
27
+ ## Use
28
+
29
+ ### Intended Use
30
+
31
+ This model is intended for text completion in English and Japanese, with a focus on code generation tasks due to its training on pylint-filtered Python code from The-Stack-v2.
32
+ It is part of the [SwallowCode ablation models](https://huggingface.co/collections/tokyotech-llm/swallowcode-6811c84ff647568547d4e443) (Experiment 3, exp3-linter-filtered) and evaluates the effect of pylint filtering in the SwallowCode pipeline.
33
+ It is not instruction-tuned and is best suited for research purposes.
34
+
35
+ ### Generation
36
+
37
+ ```python
38
+ # pip install -q transformers
39
+ from transformers import AutoModelForCausalLM, AutoTokenizer
40
+
41
+ model = "tokyotech-llm/<model-name>"
42
+ device = "cuda" # for GPU usage or "cpu" for CPU usage
43
+
44
+ tokenizer = AutoTokenizer.from_pretrained(model)
45
+ model = AutoModelForCausalLM.from_pretrained(model).to(device)
46
+
47
+ inputs = tokenizer.encode("def fibonacci(n):", return_tensors="pt").to(device)
48
+ outputs = model.generate(inputs, max_length=100)
49
+ print(tokenizer.decode(outputs[0]))
50
+ ```
51
+
52
+ ## Training
53
+
54
+ ### Model
55
+ - **Architecture**: Llama-3.1
56
+ - **Pretraining tokens**: 50B
57
+ - **Precision**: bfloat16
58
+ - **Sequence length**: 8,192
59
+ - **Tokenizer**: Llama-3 tokenizer
60
+
61
+
62
+ ### Data
63
+ The training mix consists of:
64
+
65
+ - 16% Code: pylint filtered Python subset of The-Stack-v2-train-smol-ids (8B tokens), from SwallowCode, Experiment 3.
66
+ - 84% Multilingual Text:
67
+ - Japanese Wikipedia (0.84B tokens)
68
+ - Japanese Swallow Corpus v2 (26.1B tokens)
69
+ - Laboro-ParaCorpus (0.22B tokens)
70
+ - English Wikipedia (1.1B tokens)
71
+ - English Cosmopedia (3.7B tokens)
72
+ - English DCLM (10.0B tokens)
73
+
74
+ Details are in the paper’s Appendix.
75
+
76
+ ### Hardware
77
+ - GPUs: 64 NVIDIA H100 (94GB)
78
+ - Interconnect: InfiniBand NDR200
79
+ - Supercomputer: TSUBAME, Institute of Science Tokyo
80
+
81
+ ### Software
82
+ - Megatron-LM (version core_r0.9.0) for training
83
+ - lm-evaluation-harness for evaluation
84
+ - BigCodeBench for code evaluation
85
+
86
+ ## Evaluation
87
+ The model was evaluated using the setup described in the SwallowCode paper, with the lm-evaluation-harness and BigCodeBench. Benchmarks include code generation (HumanEval, HumanEval+) and general tasks (OpenBookQA, TriviaQA, HellaSwag, SQuAD 2.0, XWINO, MMLU, GSM8K, BBH). Results are reported for checkpoints at 10B, 20B, 30B, 40B, and 50B tokens.
88
+
89
+ Evaluation Results (Experiment 3)
90
+
91
+ ### Evaluation Results (Experiment 3)
92
+
93
+ | Tokens (B) | OpenBookQA | TriviaQA | HellaSwag | SQuAD2.0 | XWINO | MMLU | GSM8K | BBH | HumanEval | HumanEval+ |
94
+ |------------|------------|----------|-----------|----------|-------|--------|--------|--------|-----------|------------|
95
+ | 10 | 0.3560 | 0.6628 | 0.6010 | 0.3340 | 0.9071| 0.6235 | 0.4564 | 0.6007 | 0.3500 | 0.3488 |
96
+ | 20 | 0.3500 | 0.6613 | 0.6015 | 0.3361 | 0.9054| 0.6237 | 0.4860 | 0.5838 | 0.3744 | 0.3787 |
97
+ | 30 | 0.3620 | 0.6596 | 0.6008 | 0.3359 | 0.9080| 0.6307 | 0.4867 | 0.5921 | 0.3957 | 0.3878 |
98
+ | 40 | 0.3720 | 0.6650 | 0.6030 | 0.3352 | 0.9058| 0.6326 | 0.4822 | 0.5990 | 0.3890 | 0.3915 |
99
+ | 50 | 0.3740 | 0.6677 | 0.6054 | 0.3291 | 0.9019| 0.6327 | 0.4996 | 0.6145 | 0.3945 | 0.3902 |
100
+
101
+ *Source: Table 4 from the SwallowCode paper, showing performance of the syntax-error and Pylint-filtered (score ≥ 7) Python subset.*
102
+
103
+
104
+ ## Citation
105
+
106
+ ```bibtex
107
+ @misc{fujii2025rewritingpretrainingdata,
108
+ title={Rewriting Pre-Training Data: Boosting LLM Performance in Math and Code},
109
+ author={Kazuki Fujii and Yukito Tajima and Sakae Mizuki and Hinari Shimada and Taihei Shiotani and Koshiro Saito and Masanari Ohi and Masaki Kawamura and Taishi Nakamura and Takumi Okamoto and Shigeki Ishida and Kakeru Hattori and Youmi Ma and Hiroya Takamura and Rio Yokota and Naoaki Okazaki},
110
+ year={2025},
111
+ eprint={XXXX.XXXXX},
112
+ archivePrefix={arXiv},
113
+ primaryClass={cs.CL},
114
+ url={https://arxiv.org/abs/XXXX.XXXXX},
115
+ }
116
+ ```