English
File size: 783 Bytes
688f317
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50c3c1a
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
---
datasets:
- phonemetransformers/IPA-BabyLM
language:
- en
---

# From Babble to Words: Tokenizers

Tokenizers trained for [From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes](https://arxiv.org/abs/2410.22906).

This repository contains the eight tokenizers trained for the project, covering the combinations of the three transformations:

- Character-based tokenization (`CHAR`) vs. subword tokenization (`BPE`)
- Tokenizer for phonemic data (`PHON`) vs. orthographic data (`TXT`)
- Tokenizer removes whitespace (`SPACELESS`) vs. keeps whitespace
  
To load a tokenizer:
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('phonemetransformers/babble-tokenizers', subfolder='BABYLM-TOKENIZER-CHAR-TXT')
```