metadata
{}
bvv241-abs: Unified Unicode Tokenizer (SOTA Intersection) with Frozen Embeddings and Extended Vector Dim (4096)
Tokenizer Description
This tokenizer is based on a hybrid vocabulary:
This tokenizer uses a strictly structured Unicode mapping scheme:
- Plane 0 (0β65535): All single Unicode code points (monograms) are mapped 1:1 to token codes, directly matching standard Unicode BMP.
- Private and unused code ranges (Plane 0 high + supplementary, e.g., 0xE000β0xF8FF and 65536β131071):
- All multi-character tokens (bigrams, trigrams, SOTA model token strings) are placed exclusively in these ranges.
- This design achieves total, lossless Unicode text coverage, with all multi-symbol tokens isolated above the core Unicode range.
- Tokenizer created from the intersection of token text across leading SOTA models
- Includes o200k_base, cl100k_base, Mistral-Nemo, QwQ-32B, DeepSeek-R1, Qwen3-32B vocabularies,
- Vocabulary size: 131,072 tokens,
- Embedding dimension: 4096.
The associated normalized_embeddings_weights.pt
file contains a [vocab_size x embed_dim] matrix of precomputed, L2-normalized, frozen embeddings.
No semantic information is encoded; embeddings remain fixed throughout LM pretraining.
No training or adaptation; suitable for plug-and-play use in research on embedding-free semantic emergence and modular LMs.
How to Get Started with the Tokenizer
from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download
import torch
tokenizer = AutoTokenizer.from_pretrained('Bochkov/bvv241-abs')
emb_path = hf_hub_download(
repo_id="Bochkov/bvv241-abs",
filename="normalized_embeddings_weights.pt"
)
embeddings = torch.load(emb_path)
π§βπ¬ Citation & Concept
If you use this model or the underlying concepts in your research, please cite our work:
@misc{bochkov2025emergentsemanticstokenembeddings,
title={Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations},
author={A. Bochkov},
year={2025},
eprint={2507.04886},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.04886},
}
@misc{bochkov2025growingtransformersmodularcomposition,
title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate},
author={A. Bochkov},
year={2025},
eprint={2507.07129},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2507.07129},
}
This work demonstrates that transformer blocks, not token embeddings, carry the semantic burden in LLMs β a step toward modular, fusable, multilingual LMs.