Model Card

SwallowCodeMath Icon

Model Summary

This model is a continual pre-training of Llama-3.1-8B on the SwallowCode ablation and multilingual text datasets. The model was trained to evaluate the performance of syntax-filtered Python code from The-Stack-v2 in the SwallowCode ablation experiments.

It was trained on 50 billion tokens using a mix of 16% SwallowCode (Experiment 2) and 84% multilingual text, following the setup described in the SwallowCode paper.

Training was performed using Megatron-LM.

Use

Intended Use

This model is intended for text completion in English and Japanese, with a focus on code generation tasks due to its training on syntax-error-free Python code from The-Stack-v2. It is part of the SwallowCode ablation models (Experiment 2, exp2-syntax-error-filtered) and evaluates the effect of syntax error filtering in the SwallowCode pipeline. It is not instruction-tuned and is best suited for research purposes.

Generation

# pip install -q transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

model = "tokyotech-llm/<model-name>"
device = "cuda"  # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModelForCausalLM.from_pretrained(model).to(device)

inputs = tokenizer.encode("def fibonacci(n):", return_tensors="pt").to(device)
outputs = model.generate(inputs, max_length=100)
print(tokenizer.decode(outputs[0]))

Training

Model

  • Architecture: Llama-3.1
  • Pretraining tokens: 50B
  • Precision: bfloat16
  • Sequence length: 8,192
  • Tokenizer: Llama-3 tokenizer

Data

The training mix consists of:

  • 16% Code: Syntax-error-free Python subset of The-Stack-v2-train-smol-ids (8B tokens), from SwallowCode, Experiment 2.
  • 84% Multilingual Text:
    • Japanese Wikipedia (0.84B tokens)
    • Japanese Swallow Corpus v2 (26.1B tokens)
    • Laboro-ParaCorpus (0.22B tokens)
    • English Wikipedia (1.1B tokens)
    • English Cosmopedia (3.7B tokens)
    • English DCLM (10.0B tokens)

Details are in the paper’s Appendix.

Hardware

  • GPUs: 64 NVIDIA H100 (94GB)
  • Interconnect: InfiniBand NDR200
  • Supercomputer: TSUBAME, Institute of Science Tokyo

Software

  • Megatron-LM (version core_r0.9.0) for training
  • lm-evaluation-harness for evaluation
  • BigCodeBench for code evaluation

Evaluation

The model was evaluated using the setup described in the SwallowCode paper, with the lm-evaluation-harness and BigCodeBench. Benchmarks include code generation (HumanEval, HumanEval+) and general tasks (OpenBookQA, TriviaQA, HellaSwag, SQuAD 2.0, XWINO, MMLU, GSM8K, BBH). Results are reported for checkpoints at 10B, 20B, 30B, 40B, and 50B tokens.

Evaluation Results (Experiment 2)

Tokens (B) OpenBookQA TriviaQA HellaSwag SQuAD2.0 XWINO MMLU GSM8K BBH HumanEval HumanEval+
10 0.3560 0.6675 0.6015 0.3385 0.9062 0.6321 0.4784 0.5881 0.3604 0.3713
20 0.3520 0.6635 0.6026 0.3364 0.9049 0.6252 0.4784 0.5781 0.3591 0.3585
30 0.3560 0.6637 0.6012 0.3375 0.9080 0.6313 0.5019 0.5950 0.3701 0.3762
40 0.3580 0.6679 0.6046 0.3346 0.9062 0.6330 0.5019 0.5998 0.3720 0.3689
50 0.3660 0.6694 0.6055 0.3340 0.9084 0.6325 0.5155 0.6044 0.3787 0.3787

Source: Table 3 from the SwallowCode paper, showing performance of the syntax-error-free Python subset.

Citation

@misc{fujii2025rewritingpretrainingdataboosts,
      title={Rewriting Pre-Training Data Boosts LLM Performance in Math and Code}, 
      author={Kazuki Fujii and Yukito Tajima and Sakae Mizuki and Hinari Shimada and Taihei Shiotani and Koshiro Saito and Masanari Ohi and Masaki Kawamura and Taishi Nakamura and Takumi Okamoto and Shigeki Ishida and Kakeru Hattori and Youmi Ma and Hiroya Takamura and Rio Yokota and Naoaki Okazaki},
      year={2025},
      eprint={2505.02881},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2505.02881}, 
}
Downloads last month
6
Safetensors
Model size
8.03B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tokyotech-llm/Llama-3.1-8B-code-ablation-exp2-LR2.5e-5-MINLR2.5E-6-WD0.1-iter0005000

Finetuned
(1260)
this model

Dataset used to train tokyotech-llm/Llama-3.1-8B-code-ablation-exp2-LR2.5e-5-MINLR2.5E-6-WD0.1-iter0005000

Collection including tokyotech-llm/Llama-3.1-8B-code-ablation-exp2-LR2.5e-5-MINLR2.5E-6-WD0.1-iter0005000