|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
- sv |
|
- de |
|
- fr |
|
- es |
|
- it |
|
- fi |
|
- bg |
|
- cs |
|
- da |
|
- el |
|
- et |
|
- hr |
|
- hu |
|
- ga |
|
- lv |
|
- lt |
|
- mt |
|
- nl |
|
- pl |
|
- pt |
|
- ro |
|
- sl |
|
- sk |
|
- ca |
|
- eu |
|
- gl |
|
- bs |
|
- ka |
|
- mk |
|
- sq |
|
- sr |
|
- tr |
|
- uk |
|
- is |
|
- 'no' |
|
library_name: tokenizers |
|
--- |
|
|
|
# Model Card for oellm-tokenizer-262k-v1 |
|
|
|
## Model Details |
|
|
|
This is a Byte-Pair Encoding (BPE) tokenizer. |
|
|
|
- **Model Type**: BPE Tokenizer |
|
- **Vocabulary Size**: 262,144 |
|
- **Special Tokens**: `<pad>`, `<eos>`, `<bos>` |
|
- **Compatibility**: Designed for Gemma3-style models. |
|
|
|
## Intended Uses & Limitations |
|
|
|
This tokenizer is intended for researchers and developers working on pre-training or fine-tuning language models for European languages and code. It is not a model and cannot be used for inference on its own. |
|
|
|
## Training Data |
|
|
|
The tokenizer was trained on a **~800 GB** randomly sampled subset of a 1.2 TB from two datasets described below. The data mixture was designed to provide broad coverage of European languages and high-quality English text. |
|
|
|
The primary data sources were: |
|
- **Nemotron-CC**: High-quality English data from Common Crawl. |
|
- **HPLT v2.0**: Multilingual data from the High Performance Language Technologies project, focusing on languages prioritized by the OpenEuroLLM initiative. |
|
|
|
## Training Procedure |
|
|
|
The tokenizer was trained on LUMI-C using a single node with 128 CPU cores and 1TB of RAM, using the Hugging Face `tokenizers` library. |
|
|
|
## Overall Average Fertility Across All Languages tested |
|
(Lower is better) |
|
- oellm-262k-v1 1.99 |
|
- gpt-oss-20b 2.06 |
|
- gemma3-4b-it 2.18 |
|
- Teuken7B 2.52 |
|
- Llama-3-8B 2.71 |
|
|