Repo Description

This repository hosts a frequency‐filtered inventory of byte-level sub-tokens extracted from the Malyuk Ukrainian corpus (38.9 M lines).
Tokenizer inherits Aya Expanse tokenizer — all of Aya’s special tokens included.

Any sub-token with total count ≥ 500 in the corpus survives, resulting in 654 023 unique entries.

Note: This is not a plug-and-play LLM tokenizer, but rather a raw statistical resource.

Simple example

tokenizer = AutoTokenizer.from_pretrained(
    "transhumanist-already-exists/malyuk-uk-bpe-654k"
)
toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
print(toks.input_ids) # [11961, 41218, 33300, 63514]

Contents

  • tokenizer.json Byte‐level tokenizer spec (vocab, merges, model settings).

  • tokenizer_config.json Configuration metadata.

  • special_tokens_map.json Mapping of special token (The same with Aya).

  • readable_tokenizer_utf8.json Human-readable dump: UTF-8-decoded sub-tokens and merge rules, for corpus-linguistic inspection.

Why publish a frequency list?

  1. Bootstrapping smaller/custom tokenizers

    • Start from this core if you only need, say, the top 256_000 or top 50_256 sub-tokens, simply truncate the tail of vocab.json. Aya’s special tokens remain intact at the head.
    • Merge or interleave these Ukrainian sub-tokens with other language vocabularies to build UK-centric multi-language tokenizers.
  2. Computational-linguistic analyses (Check file readable_tokenizer_utf8.json)

    • Zipf curve plotting, type–token ratio studies, morphological productivity analysis.
    • Stop-word and keyword list.

Training the Aya-based Ukrainian tokenizer

Below is the Python script we used to shuffle, filter by frequency (≥ 500) and train the byte-level BPE tokenizer:

import os
from datasets import load_dataset
from tokenizers.pre_tokenizers import ByteLevel
from transformers import AutoTokenizer

os.environ["TOKENIZERS_PARALLELISM"] = "true"

# Hyper-parameters
MAX_VOCAB_SIZE  = 1_000_000
CORPUS_NAME     = "lang-uk/malyuk"
SEED            = 42
TEST_SET_SIZE   = 100_000
MIN_FREQUENCY   = 500
TOKENIZER_PATH  = "./malyuk_uk_tokenizer"

# 1) Load base Aya tokenizer and corpus
tokenizer = AutoTokenizer.from_pretrained("CohereLabs/aya-expanse-32b")
full_ds = load_dataset(CORPUS_NAME, split="train", cache_dir="./ds")
ds = full_ds.remove_columns([c for c in full_ds.column_names if c != "text"])
ds = ds.shuffle(seed=SEED)

# 2) Skip the first TEST_SET_SIZE examples
ds = ds.select(range(TEST_SET_SIZE, len(ds)))

# 3) Define streaming iterator
def batch_iterator(dataset, batch_size=500_000):
    for batch in dataset.iter(batch_size=batch_size):
        yield batch["text"]

# 4) Train new tokenizer from iterator
new_tok = tokenizer.train_new_from_iterator(
    batch_iterator(ds),
    vocab_size=MAX_VOCAB_SIZE,
    length=len(ds),
    new_special_tokens=list(tokenizer.added_tokens_encoder.keys()),
    min_frequency=MIN_FREQUENCY,
    initial_alphabet=ByteLevel.alphabet()
)

# 5) Save locally
new_tok.save_pretrained(TOKENIZER_PATH)

# 6) Small test
malyuk_uk_tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_PATH, trust_remote_code=True)
test_dataset = full_ds.select(range(0, TEST_SET_SIZE))

def tokenize_wrapper(tokenizer):
    def batch_fn(examples):
        outputs = tokenizer(
            examples["text"],
            padding=False,
            truncation=False,
        )
        # list of token-counts, one per example
        return {"tokens_count": [len(ids) for ids in outputs["input_ids"]]}
    return batch_fn

ds = test_dataset.map(tokenize_wrapper(malyuk_uk_tokenizer), batched=True, batch_size=20_000)
print(f"malyuk_uk_tokenizer tokens count for 100_000 malyuk texts: {sum(ds['tokens_count'])}")

Test results:

Tokenizer Tokens for 100 000 texts
Malyuk (custom) 33 959 222
Aya Expanse-32B 49 609 840

Please note: these are total token counts for the sample, would be more correct to measure per-word averages in future.

Citation

BibTeX:

@misc{zaduha2025post9138,
  author       = "{Bohdan Didenko}",
  title        = "{Post \#9138 on Telegram Channel Zaduha}",
  howpublished = "\url{https://t.me/zaduha/9138}",
  month        = may,
  year         = {2025},
  note         = "[Online; accessed 22 May 2025]"
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train transhumanist-already-exists/malyuk-uk-bpe-654k

Collection including transhumanist-already-exists/malyuk-uk-bpe-654k