From Babble to Words
The models, tokenizers and datasets used in From Babble to Words, one of the winning BabyLM 2024 submissions, exploring phoneme-based training.
- Paper • 2410.22906 • Published
phonemetransformers/IPA-BabyLM
Viewer • Updated • 12.5M • 150 • 1phonemetransformers/IPA-BabyLM-evaluation
Preview • Updated • 231phonemetransformers/babble-tokenizers
Updated
phonemetransformers/GPT2-85M-BPE-PHON
0.1B • Updated • 33Note GPT2 with 85M non-embedding parameters trained using the BPE-PHON tokenizer.
phonemetransformers/GPT2-85M-BPE-PHON-SPACELESS
0.1B • Updated • 13Note GPT2 with 85M non-embedding parameters trained using the BPE-PHON-SPACELESS tokenizer.
phonemetransformers/GPT2-85M-CHAR-TXT-SPACELESS
0.1B • Updated • 3Note GPT2 with 85M non-embedding parameters trained using the CHAR-TXT-SPACELESS tokenizer.
phonemetransformers/GPT2-85M-CHAR-PHON
0.1B • Updated • 20Note GPT2 with 85M non-embedding parameters trained using the CHAR-PHON tokenizer.
phonemetransformers/GPT2-85M-CHAR-PHON-SPACELESS
0.1B • Updated • 22Note GPT2 with 85M non-embedding parameters trained using the CHAR-PHON-SPACELESS tokenizer.
phonemetransformers/GPT2-85M-CHAR-TXT
0.1B • Updated • 167Note GPT2 with 85M non-embedding parameters trained using the CHAR-TXT tokenizer.
phonemetransformers/GPT2-85M-BPE-TXT-SPACELESS
0.1B • Updated • 4Note GPT2 with 85M non-embedding parameters trained using the BPE-TXT-SPACELESS tokenizer.
phonemetransformers/GPT2-85M-BPE-TXT
0.1B • Updated • 5.37k • 1Note GPT2 with 85M non-embedding parameters trained using the BPE-TXT tokenizer.