Tereshchenko Blue — Gemma‑3 tokenizer faceted to let Ukrainian shine.

Tereshchenko Blue is the second biggest blue diamond in the world

By adding more than 80K Ukrainian tokens without removing any English or EU languages tokens, Tereshchenko Blue makes Ukrainian the core language in the multilingual Gemma-3 tokenizer while keeping the vocabulary fixed at its original size of 256K tokens.

How to possible

More than 16 of the most popular writing systems in the world were analyzed. Roughly four-fifths of tokens in scripts geographically and culturally distant from Ukraine—for example Bengali, Thai, Chinese, Japanese, and Korean—were pruned.

Replaced tokens

Writing system	Tokens removed	Tokens retained
Han (Chinese)	16,488	4,122
Devanagari (Hindi)	10,976	2,743
Bengali	7,983	1,995
Arabic	6,730	1,682
Hiragana / Katakana (Japanese)	3,944	985
Hangul (Korean)	3,744	935
Tamil	3,080	770
Thai	1,740	435
Malayalam	1,566	391
Telugu	1,428	356
Gujarati	1,080	270
Kannada	1,016	253
Ethiopic	691	172
Hebrew	670	167
Khmer	481	119
Sinhala	435	108
Myanmar	410	102
Lao	243	60
Gurmukhi	215	53
Tibetan	107	26
Oriya	100	25
Cyrillic	13,398	0
Gemma-3 <unused-*>	6,139	102

Feature Overview:

+81,492 new Cyrillic BPE tokens from malyuk_qirim_tokenizer.json trained on 3 millions texts from Malyuk Ukrainian corpus plus the Cyrillic slice of the Crimean Tatar corpus.
Just tokens from Replaced tokens table was replaced, no any tokens from other Writing system was affected.
Unchanged tokens preserve their IDs, enabling direct reuse of Gemma-3 embeddings.
Vocab size, Special-token set, pre/post-tokenisation logic, and output formatting match Gemma-3 one-for-one.

Simple example

tokenizer = AutoTokenizer.from_pretrained(
    "transhumanist-already-exists/tereshchenkoblue-tokenizer"
)
toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
print(toks.input_ids) - [55939, 124769, 117298, 199258] only 4 tokens 💪🏻

Metrics

Acknowledgement: evaluation results provided by @Sofetory.

	lang-uk/malyuk	100k texts	allenai/c4(en)	100k texts	allenai/c4(es, fr, it, de)	400k texts	QIRIM/crh_monocorpus(Cyrillic)	94 texts	allenai/c4(ru)	100k texts	allenai/c4(bg)	100k texts	allenai/c4(be)	100k texts
words count	22,898,164		36,170,971		198,173,216		1,868,259		42,557,519		44,627,199		43,153,645

tokenizers	tokens	toks/word	tokens	toks/word	tokens	toks/word	tokens	toks/word	tokens	toks/word	tokens	toks/word	tokens	toks/word
Qwen/Qwen3-8B	84,408,084	3.686	46,884,593	1.296	395,581,536	1.996	7,956,741	4.259	116,115,062	2.728	132,597,427	2.971	173,571,099	4.022
meta-llama/Llama-3.1-8B-Instruct	57,226,997	2.499	46,085,724	1.274	382,143,751	1.928	7,386,873	3.954	104,974,733	2.467	119,123,733	2.669	150,189,294	3.48
microsoft/Phi-4-mini-instruct	59,447,036	2.596	45,423,925	1.256	335,188,687	1.691	5,995,822	3.209	91,824,464	2.158	102,472,523	2.296	119,587,038	2.771
CohereLabs/aya-expanse-8b	50,973,632	2.226	47,364,187	1.309	353,221,932	1.782	6,614,719	3.541	93,089,697	2.187	112,612,668	2.523	141,262,943	3.273
google/gemma-3-12b-it	57,388,402	2.506	47,285,432	1.307	354,241,840	1.788	6,240,944	3.341	95,520,817	2.245	103,950,626	2.329	131,398,147	3.045
tereshchenkoblue-tokenizer (Ours)	37,277,244	1.628🤩	47,315,375	1.308	354,316,113	1.788	4,400,824	2.356	108,791,712	2.556	112,179,836	2.514	131,907,954	3.057
Comments	Significant improvement over the original Gemma 3		English tokenisation is unchanged (AllenAI / C4 contains a small amount of mixed-language text).		Tereshchenko Blue retains all EU-language tokens, so the statistics stay the same apart from lang-overlap effects.		Shows significant improvement on QIRIM Cyrillic		Russian efficiency drops, owing to the Ukrainian-centric changes, but still beats Qwen.		Other Cyrillic languages, such as Bulgarian and Belarusian, drops insignificantly.

tokenizer.json: Byte‐level tokenizer spec (vocab, merges, model settings).
malyuk_qirim_tokenizer.json: Gemma-3-style tokenizer trained on 3 mln Malyuk Ukrainian corpus plus Cyrillic QIRIM (3x oversampled).
merge_info.json: Lists the replaced Gemma-3 token IDs and the IDs of the added Malyuk tokens in malyuk_qirim_tokenizer.
tokenizer_config.json: Configuration metadata.
special_tokens_map.json: Mapping of special token (The same with Gemma-3).

Initialisation of embeddings for new tokens in Gemma 3 models

Some tokens are identical to those in the original Gemma 3 tokenizer. For the newly added tokens, you can initialise embeddings with tools such as Focus and Zett. The simplest—and often effective—alternative is to initialise the new embeddings randomly and train them with a warm-up schedule.

P.S.

In my opinion, Ukraine’s language-tech orientation toward the EU and the English-speaking world makes the tokens cut from the original Gemma-3 tokenizer a lower priority for any future national LLM.

Citation

BibTeX:

@misc{zaduha2025post9194,
  author       = "{Bohdan Didenko}",
  title        = "{Post \#9194 on Telegram Channel Zaduha}",
  howpublished = "\url{https://t.me/zaduha/9194}",
  month        = june,
  year         = {2025},
  note         = "[Online; accessed 26 June 2025]"
}

transhumanist-already-exists
/

tereshchenkoblue-tokenizer

Tereshchenko Blue — Gemma‑3 tokenizer faceted to let Ukrainian shine.

By adding more than 80K Ukrainian tokens without removing any English or EU languages tokens, Tereshchenko Blue makes Ukrainian the core language in the multilingual Gemma-3 tokenizer while keeping the vocabulary fixed at its original size of 256K tokens.

How to possible

Replaced tokens

Feature Overview:

Simple example

Metrics

Contents

Initialisation of embeddings for new tokens in Gemma 3 models

P.S.

Citation

Model tree for transhumanist-already-exists/tereshchenkoblue-tokenizer

Datasets used to train transhumanist-already-exists/tereshchenkoblue-tokenizer

Collection including transhumanist-already-exists/tereshchenkoblue-tokenizer

Tokenizers