--- # For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1 # Doc / guide: https://huggingface.co/docs/hub/model-cards {} --- # bvv241-abs: Unified Unicode Tokenizer (SOTA Intersection) with Frozen Embeddings and Extended Vector Dim (4096) ## Tokenizer Description This tokenizer is based on a hybrid vocabulary: This tokenizer uses a strictly structured Unicode mapping scheme: - Plane 0 (0–65535): All single Unicode code points (monograms) are mapped 1:1 to token codes, directly matching standard Unicode BMP. - Private and unused code ranges (Plane 0 high + supplementary, e.g., 0xE000–0xF8FF and 65536–131071): - All multi-character tokens (bigrams, trigrams, SOTA model token strings) are placed exclusively in these ranges. - This design achieves total, lossless Unicode text coverage, with all multi-symbol tokens isolated above the core Unicode range. - Tokenizer created from the intersection of token text across leading SOTA models - Includes o200k_base, cl100k_base, Mistral-Nemo, QwQ-32B, DeepSeek-R1, Qwen3-32B vocabularies, - Vocabulary size: 131,072 tokens, - Embedding dimension: 4096. The associated `normalized_embeddings_weights.pt` file contains a [vocab_size x embed_dim] matrix of precomputed, L2-normalized, frozen embeddings. No semantic information is encoded; embeddings remain fixed throughout LM pretraining. No training or adaptation; suitable for plug-and-play use in research on embedding-free semantic emergence and modular LMs. ## How to Get Started with the Tokenizer ```python from transformers import AutoTokenizer from huggingface_hub import hf_hub_download import torch tokenizer = AutoTokenizer.from_pretrained('Bochkov/bvv241-abs') emb_path = hf_hub_download( repo_id="Bochkov/bvv241-abs", filename="normalized_embeddings_weights.pt" ) embeddings = torch.load(emb_path) ``` ## 🧑‍🔬 Citation & Concept If you use this model or the underlying concepts in your research, please cite our work: ``` @misc{bochkov2025emergentsemanticstokenembeddings, title={Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations}, author={A. Bochkov}, year={2025}, eprint={2507.04886}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2507.04886}, } @misc{bochkov2025growingtransformersmodularcomposition, title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate}, author={A. Bochkov}, year={2025}, eprint={2507.07129}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2507.07129}, } ``` This work demonstrates that transformer blocks, not token embeddings, carry the semantic burden in LLMs — a step toward modular, fusable, multilingual LMs.