Tereshchenko Blue — Gemma‑3 tokenizer faceted to let Ukrainian shine.

Tereshchenko Blue is the second biggest blue diamond in the world

By adding more than 80K Ukrainian tokens without removing any English or EU languages tokens, Tereshchenko Blue makes Ukrainian the core language in the multilingual Gemma-3 tokenizer while keeping the vocabulary fixed at its original size of 256K tokens.

How to possible

More than 16 of the most popular writing systems in the world were analyzed. Roughly four-fifths of tokens in scripts geographically and culturally distant from Ukraine—for example Bengali, Thai, Chinese, Japanese, and Korean—were pruned.

Replaced tokens

Writing system Tokens removed Tokens retained
Han (Chinese) 16,488 4,122
Devanagari (Hindi) 10,976 2,743
Bengali 7,983 1,995
Arabic 6,730 1,682
Hiragana / Katakana (Japanese) 3,944 985
Hangul (Korean) 3,744 935
Tamil 3,080 770
Thai 1,740 435
Malayalam 1,566 391
Telugu 1,428 356
Gujarati 1,080 270
Kannada 1,016 253
Ethiopic 691 172
Hebrew 670 167
Khmer 481 119
Sinhala 435 108
Myanmar 410 102
Lao 243 60
Gurmukhi 215 53
Tibetan 107 26
Oriya 100 25
Cyrillic 13,398 0
Gemma-3 <unused-*> 6,139 102

Feature Overview:

  1. +81,492 new Cyrillic BPE tokens from malyuk_qirim_tokenizer.json trained on 3 millions texts from Malyuk Ukrainian corpus plus the Cyrillic slice of the Crimean Tatar corpus.
  2. Just tokens from Replaced tokens table was replaced, no any tokens from other Writing system was affected.
  3. Unchanged tokens preserve their IDs, enabling direct reuse of Gemma-3 embeddings.
  4. Vocab size, Special-token set, pre/post-tokenisation logic, and output formatting match Gemma-3 one-for-one.

Simple example

tokenizer = AutoTokenizer.from_pretrained(
    "transhumanist-already-exists/tereshchenkoblue-tokenizer"
)
toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
print(toks.input_ids) - [55939, 124769, 117298, 199258] only 4 tokens 💪🏻

Metrics

Acknowledgement: evaluation results provided by @Sofetory.

lang-uk/malyuk 100k texts allenai/c4(en) 100k texts allenai/c4(es, fr, it, de) 400k texts QIRIM/crh_monocorpus(Cyrillic) 94 texts allenai/c4(ru) 100k texts allenai/c4(bg) 100k texts allenai/c4(be) 100k texts
words count 22,898,164 36,170,971 198,173,216 1,868,259 42,557,519 44,627,199 43,153,645
tokenizers tokens toks/word tokens toks/word tokens toks/word tokens toks/word tokens toks/word tokens toks/word tokens toks/word
Qwen/Qwen3-8B 84,408,084 3.686 46,884,593 1.296 395,581,536 1.996 7,956,741 4.259 116,115,062 2.728 132,597,427 2.971 173,571,099 4.022
meta-llama/Llama-3.1-8B-Instruct 57,226,997 2.499 46,085,724 1.274 382,143,751 1.928 7,386,873 3.954 104,974,733 2.467 119,123,733 2.669 150,189,294 3.48
microsoft/Phi-4-mini-instruct 59,447,036 2.596 45,423,925 1.256 335,188,687 1.691 5,995,822 3.209 91,824,464 2.158 102,472,523 2.296 119,587,038 2.771
CohereLabs/aya-expanse-8b 50,973,632 2.226 47,364,187 1.309 353,221,932 1.782 6,614,719 3.541 93,089,697 2.187 112,612,668 2.523 141,262,943 3.273
google/gemma-3-12b-it 57,388,402 2.506 47,285,432 1.307 354,241,840 1.788 6,240,944 3.341 95,520,817 2.245 103,950,626 2.329 131,398,147 3.045
tereshchenkoblue-tokenizer (Ours) 37,277,244 1.628🤩 47,315,375 1.308 354,316,113 1.788 4,400,824 2.356 108,791,712 2.556 112,179,836 2.514 131,907,954 3.057
Comments Significant improvement over the original Gemma 3English tokenisation is unchanged (AllenAI / C4 contains a small amount of mixed-language text).Tereshchenko Blue retains all EU-language tokens, so the statistics stay the same apart from lang-overlap effects.Shows significant improvement on QIRIM CyrillicRussian efficiency drops, owing to the Ukrainian-centric changes, but still beats Qwen. Other Cyrillic languages, such as Bulgarian and Belarusian, drops insignificantly.

Contents

Initialisation of embeddings for new tokens in Gemma 3 models

Some tokens are identical to those in the original Gemma 3 tokenizer. For the newly added tokens, you can initialise embeddings with tools such as Focus and Zett. The simplest—and often effective—alternative is to initialise the new embeddings randomly and train them with a warm-up schedule.

P.S.

In my opinion, Ukraine’s language-tech orientation toward the EU and the English-speaking world makes the tokens cut from the original Gemma-3 tokenizer a lower priority for any future national LLM.

Citation

BibTeX:

@misc{zaduha2025post9194,
  author       = "{Bohdan Didenko}",
  title        = "{Post \#9194 on Telegram Channel Zaduha}",
  howpublished = "\url{https://t.me/zaduha/9194}",
  month        = june,
  year         = {2025},
  note         = "[Online; accessed 26 June 2025]"
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for transhumanist-already-exists/tereshchenkoblue-tokenizer

Finetuned
(81)
this model

Datasets used to train transhumanist-already-exists/tereshchenkoblue-tokenizer

Collection including transhumanist-already-exists/tereshchenkoblue-tokenizer