|
--- |
|
datasets: |
|
- phonemetransformers/IPA-BabyLM |
|
language: |
|
- en |
|
--- |
|
|
|
# From Babble to Words: Tokenizers |
|
|
|
Tokenizers trained for [From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes](https://arxiv.org/abs/2410.22906). |
|
|
|
This repository contains the eight tokenizers trained for the project, covering the combinations of the three transformations: |
|
|
|
- Character-based tokenization (`CHAR`) vs. subword tokenization (`BPE`) |
|
- Tokenizer for phonemic data (`PHON`) vs. orthographic data (`TXT`) |
|
- Tokenizer removes whitespace (`SPACELESS`) vs. keeps whitespace |
|
|
|
To load a tokenizer: |
|
```python |
|
from transformers import AutoTokenizer |
|
tokenizer = AutoTokenizer.from_pretrained('phonemetransformers/babble-tokenizers', subfolder='BABYLM-TOKENIZER-CHAR-TXT') |
|
``` |