English
codebyzeb commited on
Commit
688f317
·
verified ·
1 Parent(s): 834363d

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -0
README.md ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - phonemetransformers/IPA-BabyLM
4
+ language:
5
+ - en
6
+ ---
7
+
8
+ # From Babble to Words: Tokenizers
9
+
10
+ Tokenizers trained for [From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes](https://arxiv.org/abs/2410.22906).
11
+
12
+ This repository contains the eight tokenizers trained for the project, covering the combinations of the three transformations:
13
+
14
+ - Character-based tokenization (`CHAR`) vs. subword tokenization (`BPE`)
15
+ - Tokenizer for phonemic data (`PHON`) vs. orthographic data (`TXT`)
16
+ - Tokenizer removes whitespace (`SPACELESS`) vs. keeps whitespace