|
--- |
|
datasets: |
|
- phonemetransformers/IPA-CHILDES |
|
language: |
|
- zh |
|
- nl |
|
- en |
|
- et |
|
- fr |
|
- de |
|
- id |
|
- sr |
|
- es |
|
- ja |
|
--- |
|
# IPA CHILDES Models: Medium |
|
|
|
Phoneme-based GPT-2 models trained on the largest 11 sections of the [IPA-CHILDES](https://huggingface.co/datasets/phonemetransformers/IPA-CHILDES) dataset for the paper [BabyLM's First Words: Word Segmentation as a Phonological Probing Task](https://arxiv.org/abs/2504.03338). |
|
|
|
The models have 5M non-embedding parameters and were trained on 2M tokens of their language. They were evaluated for phonological knowledge using the *word segmentation* task. Check out the paper for more details. Training and analysis scripts can be found [here](https://github.com/codebyzeb/PhonemeTransformers). |
|
|
|
To load a model: |
|
```python |
|
from transformers import AutoModel |
|
french_model = AutoModel.from_pretrained('phonemetransformers/ipa-childes-models-medium', subfolder='French') |
|
``` |