BrainGPT-mutiCHILDES: GPT-2 Style Model Trained on Multilingual Child-Directed Speech
Model Description
This is a GPT-2 style language model trained from scratch on child-directed speech from 19 languages, extracted from the CHILDES corpus. The model is designed for text generation and captures patterns of early language acquisition.
Model Details
- Architecture: GPT-2 (trained from scratch)
- Languages: 19 languages from CHILDES
- Task: Text Generation
- Tokenizer: BPE
- Training Data: Cleaned child-directed speech from the CHILDES corpus
Intended Use
This model is suitable for:
- Generating child-directed speech in multiple languages
- Studying language acquisition patterns
- Augmenting research in psycholinguistics and computational linguistics
Usage
To use the model, install transformers
and load it as follows:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "your_username/your_model_name"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
input_text = "Once upon a time"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
output = model.generate(input_ids, max_length=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Training Details
- Dataset: (multiCHILDES)[]
- Training Framework: PyTorch
- Optimizer: AdamW
- Batch Size: 8
- Learning Rate: 5e-4
- Training Steps: 30000
- Checkpointing: Multiple checkpoints available
Limitations & Biases
- The model is trained on child-directed speech, so it may not generalize to other types of text.
- Language representation is limited to languages included in CHILDES.
- Possible biases from corpus data.
Citation
If you use this model, please cite:
@article{macwhinney2000childes,
title={The CHILDES Project: Tools for Analyzing Talk},
author={MacWhinney, Brian},
journal={Lawrence Erlbaum Associates},
year={2000}
}
@misc{parra2025childes,
title={BrainGPT-mutiCHILDES: GPT-2 Style Model Trained on Multilingual Child-Directed Speech},
author={Parra, Iñigo},
year={2025}
}
Acknowledgments
Special thanks to the CHILDES project for providing high-quality child language data and to the Hugging Face community for making NLP research more accessible.
- Downloads last month
- 16
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support