phonemetransformers
/

babble-tokenizers

Model card Files Files and versions Community

codebyzeb commited on 30 days ago

Commit

688f317

·

verified ·

1 Parent(s): 834363d

Create README.md

Files changed (1) hide show

README.md +16 -0

README.md ADDED Viewed

	@@ -0,0 +1,16 @@

+---
+datasets:
+- phonemetransformers/IPA-BabyLM
+language:
+- en
+---
+# From Babble to Words: Tokenizers
+Tokenizers trained for [From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes](https://arxiv.org/abs/2410.22906).
+This repository contains the eight tokenizers trained for the project, covering the combinations of the three transformations:
+- Character-based tokenization (`CHAR`) vs. subword tokenization (`BPE`)
+- Tokenizer for phonemic data (`PHON`) vs. orthographic data (`TXT`)
+- Tokenizer removes whitespace (`SPACELESS`) vs. keeps whitespace