Update README.md
Browse files
README.md
CHANGED
@@ -13,4 +13,10 @@ This repository contains the eight tokenizers trained for the project, covering
|
|
13 |
|
14 |
- Character-based tokenization (`CHAR`) vs. subword tokenization (`BPE`)
|
15 |
- Tokenizer for phonemic data (`PHON`) vs. orthographic data (`TXT`)
|
16 |
-
- Tokenizer removes whitespace (`SPACELESS`) vs. keeps whitespace
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
|
14 |
- Character-based tokenization (`CHAR`) vs. subword tokenization (`BPE`)
|
15 |
- Tokenizer for phonemic data (`PHON`) vs. orthographic data (`TXT`)
|
16 |
+
- Tokenizer removes whitespace (`SPACELESS`) vs. keeps whitespace
|
17 |
+
|
18 |
+
To load a tokenizer:
|
19 |
+
```python
|
20 |
+
from transformers import AutoTokenizer
|
21 |
+
tokenizer = AutoTokenizer.from_pretrained('phonemetransformers/babble-tokenizers', subfolder='BABYLM-TOKENIZER-CHAR-TXT')
|
22 |
+
```
|