Tereshchenko Blue — Gemma‑3 tokenizer faceted to let Ukrainian shine.

By adding more than 80K Ukrainian tokens without removing any English or EU languages tokens, Tereshchenko Blue makes Ukrainian the core language in the multilingual Gemma-3 tokenizer while keeping the vocabulary fixed at its original size of 256K tokens.
How to possible
More than 16 of the most popular writing systems in the world were analyzed. Roughly four-fifths of tokens in scripts geographically and culturally distant from Ukraine—for example Bengali, Thai, Chinese, Japanese, and Korean—were pruned.
Replaced tokens
Writing system | Tokens removed | Tokens retained |
---|---|---|
Han (Chinese) | 16,488 | 4,122 |
Devanagari (Hindi) | 10,976 | 2,743 |
Bengali | 7,983 | 1,995 |
Arabic | 6,730 | 1,682 |
Hiragana / Katakana (Japanese) | 3,944 | 985 |
Hangul (Korean) | 3,744 | 935 |
Tamil | 3,080 | 770 |
Thai | 1,740 | 435 |
Malayalam | 1,566 | 391 |
Telugu | 1,428 | 356 |
Gujarati | 1,080 | 270 |
Kannada | 1,016 | 253 |
Ethiopic | 691 | 172 |
Hebrew | 670 | 167 |
Khmer | 481 | 119 |
Sinhala | 435 | 108 |
Myanmar | 410 | 102 |
Lao | 243 | 60 |
Gurmukhi | 215 | 53 |
Tibetan | 107 | 26 |
Oriya | 100 | 25 |
Cyrillic | 13,398 | 0 |
Gemma-3 <unused-*> | 6,139 | 102 |
Feature Overview:
- +81,492 new Cyrillic BPE tokens from malyuk_qirim_tokenizer.json trained on 3 millions texts from Malyuk Ukrainian corpus plus the Cyrillic slice of the Crimean Tatar corpus.
- Just tokens from
Replaced tokens
table was replaced, no any tokens from other Writing system was affected. - Unchanged tokens preserve their IDs, enabling direct reuse of Gemma-3 embeddings.
- Vocab size, Special-token set, pre/post-tokenisation logic, and output formatting match Gemma-3 one-for-one.
Simple example
tokenizer = AutoTokenizer.from_pretrained(
"transhumanist-already-exists/tereshchenkoblue-tokenizer"
)
toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
print(toks.input_ids) - [55939, 124769, 117298, 199258] only 4 tokens 💪🏻
Metrics
Acknowledgement: evaluation results provided by @Sofetory.
lang-uk/malyuk | 100k texts | allenai/c4(en) | 100k texts | allenai/c4(es, fr, it, de) | 400k texts | QIRIM/crh_monocorpus(Cyrillic) | 94 texts | allenai/c4(ru) | 100k texts | allenai/c4(bg) | 100k texts | allenai/c4(be) | 100k texts | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
words count | 22,898,164 | 36,170,971 | 198,173,216 | 1,868,259 | 42,557,519 | 44,627,199 | 43,153,645 | |||||||||||||||||||||
tokenizers | tokens | toks/word | tokens | toks/word | tokens | toks/word | tokens | toks/word | tokens | toks/word | tokens | toks/word | tokens | toks/word | ||||||||||||||
Qwen/Qwen3-8B | 84,408,084 | 3.686 | 46,884,593 | 1.296 | 395,581,536 | 1.996 | 7,956,741 | 4.259 | 116,115,062 | 2.728 | 132,597,427 | 2.971 | 173,571,099 | 4.022 | ||||||||||||||
meta-llama/Llama-3.1-8B-Instruct | 57,226,997 | 2.499 | 46,085,724 | 1.274 | 382,143,751 | 1.928 | 7,386,873 | 3.954 | 104,974,733 | 2.467 | 119,123,733 | 2.669 | 150,189,294 | 3.48 | ||||||||||||||
microsoft/Phi-4-mini-instruct | 59,447,036 | 2.596 | 45,423,925 | 1.256 | 335,188,687 | 1.691 | 5,995,822 | 3.209 | 91,824,464 | 2.158 | 102,472,523 | 2.296 | 119,587,038 | 2.771 | ||||||||||||||
CohereLabs/aya-expanse-8b | 50,973,632 | 2.226 | 47,364,187 | 1.309 | 353,221,932 | 1.782 | 6,614,719 | 3.541 | 93,089,697 | 2.187 | 112,612,668 | 2.523 | 141,262,943 | 3.273 | ||||||||||||||
google/gemma-3-12b-it | 57,388,402 | 2.506 | 47,285,432 | 1.307 | 354,241,840 | 1.788 | 6,240,944 | 3.341 | 95,520,817 | 2.245 | 103,950,626 | 2.329 | 131,398,147 | 3.045 | ||||||||||||||
tereshchenkoblue-tokenizer (Ours) | 37,277,244 | 1.628🤩 | 47,315,375 | 1.308 | 354,316,113 | 1.788 | 4,400,824 | 2.356 | 108,791,712 | 2.556 | 112,179,836 | 2.514 | 131,907,954 | 3.057 | ||||||||||||||
Comments | Significant improvement over the original Gemma 3 | English tokenisation is unchanged (AllenAI / C4 contains a small amount of mixed-language text). | Tereshchenko Blue retains all EU-language tokens, so the statistics stay the same apart from lang-overlap effects. | Shows significant improvement on QIRIM Cyrillic | Russian efficiency drops, owing to the Ukrainian-centric changes, but still beats Qwen. | Other Cyrillic languages, such as Bulgarian and Belarusian, drops insignificantly. |
Contents
tokenizer.json: Byte‐level tokenizer spec (vocab, merges, model settings).
malyuk_qirim_tokenizer.json: Gemma-3-style tokenizer trained on 3 mln Malyuk Ukrainian corpus plus Cyrillic QIRIM (3x oversampled).
merge_info.json: Lists the replaced Gemma-3 token IDs and the IDs of the added Malyuk tokens in malyuk_qirim_tokenizer.
tokenizer_config.json: Configuration metadata.
special_tokens_map.json: Mapping of special token (The same with Gemma-3).
Initialisation of embeddings for new tokens in Gemma 3 models
Some tokens are identical to those in the original Gemma 3 tokenizer. For the newly added tokens, you can initialise embeddings with tools such as Focus and Zett. The simplest—and often effective—alternative is to initialise the new embeddings randomly and train them with a warm-up schedule.
P.S.
In my opinion, Ukraine’s language-tech orientation toward the EU and the English-speaking world makes the tokens cut from the original Gemma-3 tokenizer a lower priority for any future national LLM.
Citation
BibTeX:
@misc{zaduha2025post9194,
author = "{Bohdan Didenko}",
title = "{Post \#9194 on Telegram Channel Zaduha}",
howpublished = "\url{https://t.me/zaduha/9194}",
month = june,
year = {2025},
note = "[Online; accessed 26 June 2025]"
}