Achieving Tokenizer Flexibility in Language Models through Heuristic Adaptation and Supertoken Learning
Abstract
Pretrained language models (LLMs) are often constrained by their fixed tokenization schemes, leading to inefficiencies and performance limitations, particularly for multilingual or specialized applications. This tokenizer lock-in presents significant challenges. standard methods to overcome this often require prohibitive computational resources. Although tokenizer replacement with heuristic initialization aims to reduce this burden, existing methods often require exhaustive residual fine-tuning and still may not fully preserve semantic nuances or adequately address the underlying compression inefficiencies. Our framework introduces two innovations: first, Tokenadapt, a model-agnostic tokenizer transplantation method, and second, novel pre-tokenization learning for multi-word Supertokens to enhance compression and reduce fragmentation. Tokenadapt initializes new unique token embeddings via a hybrid heuristic that combines two methods: a local estimate based on subword decomposition using the old tokenizer, and a global estimate utilizing the top-k semantically similar tokens from the original vocabulary. This methodology aims to preserve semantics while significantly minimizing retraining requirements. Empirical investigations validate both contributions: the transplantation heuristic successfully initializes unique tokens, markedly outperforming conventional baselines and sophisticated methods including Transtokenizer and ReTok, while our Supertokens achieve notable compression gains. Our zero-shot perplexity results demonstrate that the TokenAdapt hybrid initialization consistently yields lower perplexity ratios compared to both ReTok and TransTokenizer baselines across different base models and newly trained target tokenizers. TokenAdapt typically reduced the overall perplexity ratio significantly compared to ReTok, yielding at least a 2-fold improvement in these aggregate scores.
Community
Pretrained language models (LLMs) are tied to a fixed tokenizer. This "tokenizer lock‑in" hurts efficiency and accuracy, especially in multilingual or domain‑specific settings. Replacing the tokenizer is attractive, but existing methods need costly end‑to‑end fine‑tuning and often lose meaning. We present a two‑part framework that keeps cost low and quality high.
TokenAdapt🛠️:
A model‑agnostic procedure that transplants a new tokenizer into a frozen LLM. Unique tokens are initialized with a hybrid heuristic that combines (a) a local approximation from subword decomposition in the old vocabulary and (b) a global approximation from the top‑k semantically closest tokens.
Supertokens⚡:
A light pre‑tokenization stage that learns frequent multi‑word units, increasing compression and lowering sequence length.
In zero‑shot evaluation across multiple base models and target tokenizers, TokenAdapt cuts perplexity ratios by up to 2x compared with ReTok and outperforms TransTokenizer without any extra training. When combined with supertokens, sequence length drops, further reducing compute.
Our results show that tokenizer transplantation and learned supertokens can unlock the benefits of custom tokenizers while avoiding the heavy cost of full model retraining.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- HYPEROFA: Expanding LLM Vocabulary to New Languages via Hypernetwork-Based Embedding Initialization (2025)
- AdaptiVocab: Enhancing LLM Efficiency in Focused Domains through Lightweight Vocabulary Adaptation (2025)
- Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation (2025)
- Overcoming Vocabulary Constraints with Pixel-level Fallback (2025)
- Parameter-Efficient Transformer Embeddings (2025)
- Bielik v3 Small: Technical Report (2025)
- Overcoming Vocabulary Mismatch: Vocabulary-agnostic Teacher Guided Language Modeling (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper