--- datasets: - phonemetransformers/IPA-BabyLM language: - en --- # From Babble to Words: Tokenizers Tokenizers trained for [From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes](https://arxiv.org/abs/2410.22906). This repository contains the eight tokenizers trained for the project, covering the combinations of the three transformations: - Character-based tokenization (`CHAR`) vs. subword tokenization (`BPE`) - Tokenizer for phonemic data (`PHON`) vs. orthographic data (`TXT`) - Tokenizer removes whitespace (`SPACELESS`) vs. keeps whitespace To load a tokenizer: ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('phonemetransformers/babble-tokenizers', subfolder='BABYLM-TOKENIZER-CHAR-TXT') ```