metadata
datasets:
- phonemetransformers/IPA-BabyLM
language:
- en
From Babble to Words: Tokenizers
Tokenizers trained for From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes.
This repository contains the eight tokenizers trained for the project, covering the combinations of the three transformations:
- Character-based tokenization (
CHAR
) vs. subword tokenization (BPE
) - Tokenizer for phonemic data (
PHON
) vs. orthographic data (TXT
) - Tokenizer removes whitespace (
SPACELESS
) vs. keeps whitespace
To load a tokenizer:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('phonemetransformers/babble-tokenizers', subfolder='BABYLM-TOKENIZER-CHAR-TXT')