Training-Free Tokenizer Transplantation via Orthogonal Matching Pursuit
Abstract
A training-free method using Orthogonal Matching Pursuit (OMP) effectively transplants tokenizers in pretrained large language models, preserving performance across different tokenizers without gradient updates.
We present a training-free method to transplant tokenizers in pretrained large language models (LLMs) by reconstructing unseen token embeddings via Orthogonal Matching Pursuit (OMP). Specifically, we approximate each out-of-vocabulary token as a sparse linear combination of shared tokens, in two phases: first, compute each new token's representation in the donor embedding space with a small dictionary of shared anchor tokens, then transfer these same sparse coefficients back into the base model's embedding space. On two challenging cross-tokenizer tasks--LlamatoMistral NeMo (12B) and QwentoLlama (1B)--we show that OMP achieves best zero-shot preservation of the base model's performance across multiple benchmarks, while other zero-shot approaches degrade significantly. Compared to baselines (zero-init, mean-init, and existing approaches like WECHSEL, FOCUS, ZETT), OMP consistently achieves the best overall performance, effectively bridging large tokenizer discrepancies without gradient updates. Our analysis further identifies mismatched numerical tokenization schemes as a critical challenge for preserving mathematical reasoning capabilities. This technique enables direct reuse of pretrained model weights with new tokenizers, facilitating cross-tokenizer knowledge distillation, speculative decoding, ensembling, merging, and domain-specific vocabulary adaptations. We integrate our method into the open-source mergekit-tokensurgeon tool for post hoc vocabulary realignment.
Community
A training-free method to transplant tokenizers between pretrained language models.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Achieving Tokenizer Flexibility in Language Models through Heuristic Adaptation and Supertoken Learning (2025)
- TokAlign: Efficient Vocabulary Adaptation via Token Alignment (2025)
- AweDist: Attention-aware Embedding Distillation for New Input Token Embeddings (2025)
- Semantic Aware Linear Transfer by Recycling Pre-trained Language Models for Cross-lingual Transfer (2025)
- HYPEROFA: Expanding LLM Vocabulary to New Languages via Hypernetwork-Based Embedding Initialization (2025)
- zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression (2025)
- Parameter-Efficient Transformer Embeddings (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper