Papers
arxiv:2501.16975

Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling

Published on Jan 28
· Submitted by akhaliq on Jan 29
#3 Paper of the day

Abstract

Tokenization is a fundamental component of large language models (LLMs), yet its influence on model scaling and performance is not fully explored. In this paper, we introduce Over-Tokenized Transformers, a novel framework that decouples input and output vocabularies to improve language modeling performance. Specifically, our approach scales up input vocabularies to leverage multi-gram tokens. Through extensive experiments, we uncover a log-linear relationship between input vocabulary size and training loss, demonstrating that larger input vocabularies consistently enhance model performance, regardless of model size. Using a large input vocabulary, we achieve performance comparable to double-sized baselines with no additional cost. Our findings highlight the importance of tokenization in scaling laws and provide practical insight for tokenizer design, paving the way for more efficient and powerful LLMs.

Community

Paper submitter

Screenshot 2025-01-28 at 11.49.17 PM.png

Any code available to reproduce the paper's results?

·
Paper author

Thanks for your attention. The pseudo code in appendix might help. We believe the gains should be easy to reproduce.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2501.16975 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2501.16975 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2501.16975 in a Space README.md to link it from this page.

Collections including this paper 4