Safetensors
codebyzeb's picture
Update README.md
22b69bd verified
metadata
datasets:
  - phonemetransformers/IPA-CHILDES
language:
  - zh
  - nl
  - en
  - et
  - fr
  - de
  - id
  - sr
  - es
  - ja
  - it
  - ko
  - pl
  - pt
  - sv

IPA CHILDES Models: Small

Phoneme-based GPT-2 models trained on the largest 17 sections of the IPA-CHILDES dataset for the paper BabyLM's First Words: Word Segmentation as a Phonological Probing Task.

The models have 800k non-embedding parameters and were trained on 700k tokens of their language. They were evaluated for phonological knowledge using the word segmentation task. Check out the paper for more details. Training and analysis scripts can be found here.

To load a model:

from transformers import AutoModel
swedish_model = AutoModel.from_pretrained('phonemetransformers/ipa-childes-models-small', subfolder='Swedish')