Repo Description

This repository hosts a frequency‐filtered inventory of byte-level sub-tokens extracted from the Crimean Tatar corpus.
Tokenizer inherits Aya Expanse tokenizer — all of Aya’s special tokens included.

Any sub-token with total count ≥ 6 in the corpus survives, resulting in 50_256 unique entries.

Note: This is not a plug-and-play LLM tokenizer, but rather a raw statistical resource.

Simple example

tokenizer = AutoTokenizer.from_pretrained(
    "transhumanist-already-exists/crh_monocorpus-bpe-50_256"
)
toks = tokenizer("Qırımtatarlar – halq olaraq Qırımda şekillendi.", add_special_tokens=False)
print(toks.input_ids)# [15125, 3633, 654, 505, 3570, 6499, 16162, 8525, 22927, 50]
print(len(toks.input_ids))# 10

#Compare with aya-expanse

tokenizer = AutoTokenizer.from_pretrained("CohereLabs/aya-expanse-32b")
toks = tokenizer("Qırımtatarlar – halq olaraq Qırımda şekillendi.", add_special_tokens=False)
print(toks.input_ids)# [56, 78927, 91, 9426, 2684, 2129, 12579, 88, 1691, 24713, 88, 2672, 67673, 107589, 23366, 1873, 15031, 21]
print(len(toks.input_ids))# 18!

tokenizer.json Byte‐level tokenizer spec (vocab, merges, model settings).
tokenizer_config.json Configuration metadata.
special_tokens_map.json Mapping of special token (The same with Aya).
readable_tokenizer_utf8.json Human-readable dump: UTF-8-decoded sub-tokens and merge rules, for corpus-linguistic inspection.

Why publish a frequency list?

Bootstrapping smaller/custom tokenizers
- Merge or interleave these QIRIM sub-tokens with other language vocabularies.
Computational-linguistic analyses (Check file readable_tokenizer_utf8.json)
- Zipf curve plotting, type–token ratio studies, morphological productivity analysis.
- Stop-word and keyword list.

Training the Aya-based Crimean Tatar tokenizer

Below is the Python script we used to shuffle, filter by frequency (≥ 6) and train the byte-level BPE tokenizer:

import os
from datasets import load_dataset
from tokenizers.pre_tokenizers import ByteLevel
from transformers import AutoTokenizer

os.environ["TOKENIZERS_PARALLELISM"] = "true"

# Hyper-parameters
MAX_VOCAB_SIZE  = 50_256
CORPUS_NAME     = "QIRIM/crh_monocorpus"
SEED            = 42
MIN_FREQUENCY   = 6
TOKENIZER_PATH  = "./crh_monocorpus-bpe-50_256"

# 1) Load base Aya tokenizer and corpus
tokenizer = AutoTokenizer.from_pretrained("CohereLabs/aya-expanse-32b")
full_ds = load_dataset(CORPUS_NAME, split="train", cache_dir="./ds")
ds = full_ds.remove_columns([c for c in full_ds.column_names if c != "text"])
ds = ds.shuffle(seed=SEED)

# 3) Define streaming iterator
def batch_iterator(dataset, batch_size=len(ds)):
    for batch in dataset.iter(batch_size=batch_size):
        yield batch["text"]

# 4) Train new tokenizer from iterator
new_tok = tokenizer.train_new_from_iterator(
    batch_iterator(ds),
    vocab_size=MAX_VOCAB_SIZE,
    length=len(ds),
    new_special_tokens=list(tokenizer.added_tokens_encoder.keys()),
    min_frequency=MIN_FREQUENCY,
    initial_alphabet=ByteLevel.alphabet()
)

# 5) Save locally
new_tok.save_pretrained(TOKENIZER_PATH)

Citation

BibTeX:

@misc{zaduha2025post9143,
  author       = "{Bohdan Didenko}",
  title        = "{Post \#9138 on Telegram Channel Zaduha}",
  howpublished = "\url{https://t.me/zaduha/9143}",
  month        = may,
  year         = {2025},
  note         = "[Online; accessed 24 May 2025]"
}

transhumanist-already-exists
/

crh_monocorpus-bpe-50_256

Repo Description

Simple example

Contents

Why publish a frequency list?

Training the Aya-based Crimean Tatar tokenizer

Citation

Dataset used to train transhumanist-already-exists/crh_monocorpus-bpe-50_256

Collection including transhumanist-already-exists/crh_monocorpus-bpe-50_256

Tokenizers