Repo Description
This repository hosts a frequency‐filtered inventory of byte-level sub-tokens extracted from the Crimean Tatar corpus.
Tokenizer inherits Aya Expanse tokenizer — all of Aya’s special tokens included.
Any sub-token with total count ≥ 6 in the corpus survives, resulting in 50_256 unique entries.
Note: This is not a plug-and-play LLM tokenizer, but rather a raw statistical resource.
Simple example
tokenizer = AutoTokenizer.from_pretrained(
"transhumanist-already-exists/crh_monocorpus-bpe-50_256"
)
toks = tokenizer("Qırımtatarlar – halq olaraq Qırımda şekillendi.", add_special_tokens=False)
print(toks.input_ids)# [15125, 3633, 654, 505, 3570, 6499, 16162, 8525, 22927, 50]
print(len(toks.input_ids))# 10
#Compare with aya-expanse
tokenizer = AutoTokenizer.from_pretrained("CohereLabs/aya-expanse-32b")
toks = tokenizer("Qırımtatarlar – halq olaraq Qırımda şekillendi.", add_special_tokens=False)
print(toks.input_ids)# [56, 78927, 91, 9426, 2684, 2129, 12579, 88, 1691, 24713, 88, 2672, 67673, 107589, 23366, 1873, 15031, 21]
print(len(toks.input_ids))# 18!
Contents
tokenizer.json
Byte‐level tokenizer spec (vocab, merges, model settings).tokenizer_config.json
Configuration metadata.special_tokens_map.json
Mapping of special token (The same with Aya).readable_tokenizer_utf8.json
Human-readable dump: UTF-8-decoded sub-tokens and merge rules, for corpus-linguistic inspection.
Why publish a frequency list?
Bootstrapping smaller/custom tokenizers
- Merge or interleave these QIRIM sub-tokens with other language vocabularies.
Computational-linguistic analyses (Check file
readable_tokenizer_utf8.json
)- Zipf curve plotting, type–token ratio studies, morphological productivity analysis.
- Stop-word and keyword list.
Training the Aya-based Crimean Tatar tokenizer
Below is the Python script we used to shuffle, filter by frequency (≥ 6) and train the byte-level BPE tokenizer:
import os
from datasets import load_dataset
from tokenizers.pre_tokenizers import ByteLevel
from transformers import AutoTokenizer
os.environ["TOKENIZERS_PARALLELISM"] = "true"
# Hyper-parameters
MAX_VOCAB_SIZE = 50_256
CORPUS_NAME = "QIRIM/crh_monocorpus"
SEED = 42
MIN_FREQUENCY = 6
TOKENIZER_PATH = "./crh_monocorpus-bpe-50_256"
# 1) Load base Aya tokenizer and corpus
tokenizer = AutoTokenizer.from_pretrained("CohereLabs/aya-expanse-32b")
full_ds = load_dataset(CORPUS_NAME, split="train", cache_dir="./ds")
ds = full_ds.remove_columns([c for c in full_ds.column_names if c != "text"])
ds = ds.shuffle(seed=SEED)
# 3) Define streaming iterator
def batch_iterator(dataset, batch_size=len(ds)):
for batch in dataset.iter(batch_size=batch_size):
yield batch["text"]
# 4) Train new tokenizer from iterator
new_tok = tokenizer.train_new_from_iterator(
batch_iterator(ds),
vocab_size=MAX_VOCAB_SIZE,
length=len(ds),
new_special_tokens=list(tokenizer.added_tokens_encoder.keys()),
min_frequency=MIN_FREQUENCY,
initial_alphabet=ByteLevel.alphabet()
)
# 5) Save locally
new_tok.save_pretrained(TOKENIZER_PATH)
Citation
BibTeX:
@misc{zaduha2025post9143,
author = "{Bohdan Didenko}",
title = "{Post \#9138 on Telegram Channel Zaduha}",
howpublished = "\url{https://t.me/zaduha/9143}",
month = may,
year = {2025},
note = "[Online; accessed 24 May 2025]"
}