jonasaise's picture
Update README.md
45394f1 verified
---
license: apache-2.0
language:
- en
- sv
- de
- fr
- es
- it
- fi
- bg
- cs
- da
- el
- et
- hr
- hu
- ga
- lv
- lt
- mt
- nl
- pl
- pt
- ro
- sl
- sk
- ca
- eu
- gl
- bs
- ka
- mk
- sq
- sr
- tr
- uk
- is
- 'no'
library_name: tokenizers
---
# Model Card for oellm-tokenizer-262k-v1
## Model Details
This is a Byte-Pair Encoding (BPE) tokenizer.
- **Model Type**: BPE Tokenizer
- **Vocabulary Size**: 262,144
- **Special Tokens**: `<pad>`, `<eos>`, `<bos>`
- **Compatibility**: Designed for Gemma3-style models.
## Intended Uses & Limitations
This tokenizer is intended for researchers and developers working on pre-training or fine-tuning language models for European languages and code. It is not a model and cannot be used for inference on its own.
## Training Data
The tokenizer was trained on a **~800 GB** randomly sampled subset of a 1.2 TB from two datasets described below. The data mixture was designed to provide broad coverage of European languages and high-quality English text.
The primary data sources were:
- **Nemotron-CC**: High-quality English data from Common Crawl.
- **HPLT v2.0**: Multilingual data from the High Performance Language Technologies project, focusing on languages prioritized by the OpenEuroLLM initiative.
## Training Procedure
The tokenizer was trained on LUMI-C using a single node with 128 CPU cores and 1TB of RAM, using the Hugging Face `tokenizers` library.
## Overall Average Fertility Across All Languages tested
(Lower is better)
- oellm-262k-v1 1.99
- gpt-oss-20b 2.06
- gemma3-4b-it 2.18
- Teuken7B 2.52
- Llama-3-8B 2.71