jonasaise
/

oellm_tokenizer_262k_v1

Model card Files Files and versions

oellm_tokenizer_262k_v1 / README.md

jonasaise's picture

Update README.md

45394f1 verified 29 days ago

|

history blame contribute delete

1.62 kB

	---
	license: apache-2.0
	language:
	- en
	- sv
	- de
	- fr
	- es
	- it
	- fi
	- bg
	- cs
	- da
	- el
	- et
	- hr
	- hu
	- ga
	- lv
	- lt
	- mt
	- nl
	- pl
	- pt
	- ro
	- sl
	- sk
	- ca
	- eu
	- gl
	- bs
	- ka
	- mk
	- sq
	- sr
	- tr
	- uk
	- is
	- 'no'
	library_name: tokenizers
	---

	# Model Card for oellm-tokenizer-262k-v1

	## Model Details

	This is a Byte-Pair Encoding (BPE) tokenizer.

	- Model Type: BPE Tokenizer
	- Vocabulary Size: 262,144
	- Special Tokens: `<pad>`, `<eos>`, `<bos>`
	- Compatibility: Designed for Gemma3-style models.

	## Intended Uses & Limitations

	This tokenizer is intended for researchers and developers working on pre-training or fine-tuning language models for European languages and code. It is not a model and cannot be used for inference on its own.

	## Training Data

	The tokenizer was trained on a ~800 GB randomly sampled subset of a 1.2 TB from two datasets described below. The data mixture was designed to provide broad coverage of European languages and high-quality English text.

	The primary data sources were:
	- Nemotron-CC: High-quality English data from Common Crawl.
	- HPLT v2.0: Multilingual data from the High Performance Language Technologies project, focusing on languages prioritized by the OpenEuroLLM initiative.

	## Training Procedure

	The tokenizer was trained on LUMI-C using a single node with 128 CPU cores and 1TB of RAM, using the Hugging Face `tokenizers` library.

	## Overall Average Fertility Across All Languages tested
	(Lower is better)
	- oellm-262k-v1 1.99
	- gpt-oss-20b 2.06
	- gemma3-4b-it 2.18
	- Teuken7B 2.52
	- Llama-3-8B 2.71