English
babble-tokenizers / README.md
codebyzeb's picture
Update README.md
50c3c1a verified
metadata
datasets:
  - phonemetransformers/IPA-BabyLM
language:
  - en

From Babble to Words: Tokenizers

Tokenizers trained for From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes.

This repository contains the eight tokenizers trained for the project, covering the combinations of the three transformations:

  • Character-based tokenization (CHAR) vs. subword tokenization (BPE)
  • Tokenizer for phonemic data (PHON) vs. orthographic data (TXT)
  • Tokenizer removes whitespace (SPACELESS) vs. keeps whitespace

To load a tokenizer:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('phonemetransformers/babble-tokenizers', subfolder='BABYLM-TOKENIZER-CHAR-TXT')