Best demo models [pretrain]
Frozen embedding LMs (en/ru/zh) & their MoE fusion. Baselines: frozen vs unfrozen embedding ablation.
Updated • 15Note best_bvv_moe is a demonstration-scale Mixture-of-Experts (MoE) decoder-only causal language model combining two independently trained models (Russian and Chinese) with strictly frozen, shared visual/Unicode-based token embeddings. Each "expert" was pre-trained on a small subordinate corpus (English-Russian, English-Chinese) with ~9B total tokens, mixing 10% SFT-like samples, using the same, fully frozen embedding matrix for all languages.
Bochkov/best_bvv_ru
Updated • 13Note Proof-of-concept Transformer LM with frozen, non-semantic token embeddings trained on a small English-Russian corpus. This model is part of a series of models designed to demonstrate: The viability of transformer language models where the embedding layer is precomputed from non-semantic (Unicode/visual) features and entirely frozen during training. The possibility of modular/federated model fusion (MoE) by combining models with a shared token embedding matrix, without any additional retraining
Bochkov/best_bvv_unfrozen_ru
Updated • 16Note best_bvv_unfrozen_ru is a 500M parameter Causal Language Model (LM) for Russian (and some English), trained as an open proof-of-concept for the "frozen embeddings" paradigm. This version uses fully trainable token embeddings – a standard setup – and serves as a baseline for direct comparison with the corresponding "frozen-embedding" model Bochkov/best_bvv_ru.
Bochkov/best_bvv_zh
Updated • 13Note best_bvv_zh is a conceptual bilingual (English + Chinese) transformer language model trained from scratch on a limited-size 9B-token corpus, as a demonstration of the frozen-embedding hypothesis for robust, language-agnostic and easily-combinable language models. Embedding matrix is frozen after visual-based (Unicode-morpheme) initialization. All transformer layers and output head are trainable.
Bochkov/best_bvv_unfrozen_zh
Updated • 11Note best_bvv_unfrozen_zh is a 0.5B parameter causal Transformer language model trained on a minimal combined English-Chinese corpus with an open-vocabulary Unicode-based tokenizer (total 9B tokens, ~10% SFT/instruction mix). Embedding layer is trainable (not frozen) for direct comparison with the frozen-embedding variants (best_bvv_zh).
Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations
Paper • 2507.04886 • Published • 1Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate
Paper • 2507.07129 • Published • 2