Lapa v0.1.2 Release
Collection
Release of SOTA Ukrainian LLM and Datasets
•
19 items
•
Updated
•
15
By adding more than 80K Ukrainian tokens without removing any English or EU languages tokens, Lapa Tokenizer makes Ukrainian the core language in the multilingual Gemma-3 tokenizer while keeping the vocabulary fixed at its original size of 256K tokens.
More than 16 of the most popular writing systems in the world were analyzed. Roughly four-fifths of tokens in scripts geographically and culturally distant from Ukraine—for example Bengali, Thai, Chinese, Japanese, and Korean—were pruned.
| Writing system | Tokens removed | Tokens retained |
|---|---|---|
| Han (Chinese) | 16,488 | 4,122 |
| Devanagari (Hindi) | 10,976 | 2,743 |
| Bengali | 7,983 | 1,995 |
| Arabic | 6,730 | 1,682 |
| Hiragana / Katakana (Japanese) | 3,944 | 985 |
| Hangul (Korean) | 3,744 | 935 |
| Tamil | 3,080 | 770 |
| Thai | 1,740 | 435 |
| Malayalam | 1,566 | 391 |
| Telugu | 1,428 | 356 |
| Gujarati | 1,080 | 270 |
| Kannada | 1,016 | 253 |
| Ethiopic | 691 | 172 |
| Hebrew | 670 | 167 |
| Khmer | 481 | 119 |
| Sinhala | 435 | 108 |
| Myanmar | 410 | 102 |
| Lao | 243 | 60 |
| Gurmukhi | 215 | 53 |
| Tibetan | 107 | 26 |
| Oriya | 100 | 25 |
| Cyrillic | 13,398 | 0 |
| Gemma-3 <unused-*> | 6,139 | 102 |
Replaced tokens table was replaced, no any tokens from other Writing system was affected.tokenizer = AutoTokenizer.from_pretrained("lapa-llm/tokenizer")
toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
print(len(toks.input_ids)) -only 4 tokens 💪🏻
<think></think> for hybrid approach. This significantly speeds up tokenization.