Repo Description
This repository hosts a frequency‐filtered inventory of byte-level sub-tokens extracted from the Malyuk Ukrainian corpus (38.9 M lines).
Tokenizer inherits Aya Expanse tokenizer — all of Aya’s special tokens included.
Any sub-token with total count ≥ 500 in the corpus survives, resulting in 654 023 unique entries.
Note: This is not a plug-and-play LLM tokenizer, but rather a raw statistical resource.
Simple example
tokenizer = AutoTokenizer.from_pretrained(
"transhumanist-already-exists/malyuk-uk-bpe-654k"
)
toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
print(toks.input_ids) # [11961, 41218, 33300, 63514]
Contents
tokenizer.json
Byte‐level tokenizer spec (vocab, merges, model settings).tokenizer_config.json
Configuration metadata.special_tokens_map.json
Mapping of special token (The same with Aya).readable_tokenizer_utf8.json
Human-readable dump: UTF-8-decoded sub-tokens and merge rules, for corpus-linguistic inspection.
Why publish a frequency list?
Bootstrapping smaller/custom tokenizers
- Start from this core if you only need, say, the top 256_000 or top 50_256 sub-tokens, simply truncate the tail of
vocab.json
. Aya’s special tokens remain intact at the head. - Merge or interleave these Ukrainian sub-tokens with other language vocabularies to build UK-centric multi-language tokenizers.
- Start from this core if you only need, say, the top 256_000 or top 50_256 sub-tokens, simply truncate the tail of
Computational-linguistic analyses (Check file
readable_tokenizer_utf8.json
)- Zipf curve plotting, type–token ratio studies, morphological productivity analysis.
- Stop-word and keyword list.
Training the Aya-based Ukrainian tokenizer
Below is the Python script we used to shuffle, filter by frequency (≥ 500) and train the byte-level BPE tokenizer:
import os
from datasets import load_dataset
from tokenizers.pre_tokenizers import ByteLevel
from transformers import AutoTokenizer
os.environ["TOKENIZERS_PARALLELISM"] = "true"
# Hyper-parameters
MAX_VOCAB_SIZE = 1_000_000
CORPUS_NAME = "lang-uk/malyuk"
SEED = 42
TEST_SET_SIZE = 100_000
MIN_FREQUENCY = 500
TOKENIZER_PATH = "./malyuk_uk_tokenizer"
# 1) Load base Aya tokenizer and corpus
tokenizer = AutoTokenizer.from_pretrained("CohereLabs/aya-expanse-32b")
full_ds = load_dataset(CORPUS_NAME, split="train", cache_dir="./ds")
ds = full_ds.remove_columns([c for c in full_ds.column_names if c != "text"])
ds = ds.shuffle(seed=SEED)
# 2) Skip the first TEST_SET_SIZE examples
ds = ds.select(range(TEST_SET_SIZE, len(ds)))
# 3) Define streaming iterator
def batch_iterator(dataset, batch_size=500_000):
for batch in dataset.iter(batch_size=batch_size):
yield batch["text"]
# 4) Train new tokenizer from iterator
new_tok = tokenizer.train_new_from_iterator(
batch_iterator(ds),
vocab_size=MAX_VOCAB_SIZE,
length=len(ds),
new_special_tokens=list(tokenizer.added_tokens_encoder.keys()),
min_frequency=MIN_FREQUENCY,
initial_alphabet=ByteLevel.alphabet()
)
# 5) Save locally
new_tok.save_pretrained(TOKENIZER_PATH)
# 6) Small test
malyuk_uk_tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_PATH, trust_remote_code=True)
test_dataset = full_ds.select(range(0, TEST_SET_SIZE))
def tokenize_wrapper(tokenizer):
def batch_fn(examples):
outputs = tokenizer(
examples["text"],
padding=False,
truncation=False,
)
# list of token-counts, one per example
return {"tokens_count": [len(ids) for ids in outputs["input_ids"]]}
return batch_fn
ds = test_dataset.map(tokenize_wrapper(malyuk_uk_tokenizer), batched=True, batch_size=20_000)
print(f"malyuk_uk_tokenizer tokens count for 100_000 malyuk texts: {sum(ds['tokens_count'])}")
Test results:
Tokenizer | Tokens for 100 000 texts |
---|---|
Malyuk (custom) | 33 959 222 |
Aya Expanse-32B | 49 609 840 |
Please note: these are total token counts for the sample, would be more correct to measure per-word averages in future.
Citation
BibTeX:
@misc{zaduha2025post9138,
author = "{Bohdan Didenko}",
title = "{Post \#9138 on Telegram Channel Zaduha}",
howpublished = "\url{https://t.me/zaduha/9138}",
month = may,
year = {2025},
note = "[Online; accessed 22 May 2025]"
}