BrainGPT-mutiCHILDES: GPT-2 Style Model Trained on Multilingual Child-Directed Speech

Model Description

This is a GPT-2 style language model trained from scratch on child-directed speech from 19 languages, extracted from the CHILDES corpus. The model is designed for text generation and captures patterns of early language acquisition.

Model Details

  • Architecture: GPT-2 (trained from scratch)
  • Languages: 19 languages from CHILDES
  • Task: Text Generation
  • Tokenizer: BPE
  • Training Data: Cleaned child-directed speech from the CHILDES corpus

Intended Use

This model is suitable for:

  • Generating child-directed speech in multiple languages
  • Studying language acquisition patterns
  • Augmenting research in psycholinguistics and computational linguistics

Usage

To use the model, install transformers and load it as follows:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "your_username/your_model_name"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

input_text = "Once upon a time"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

output = model.generate(input_ids, max_length=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Training Details

  • Dataset: (multiCHILDES)[]
  • Training Framework: PyTorch
  • Optimizer: AdamW
  • Batch Size: 8
  • Learning Rate: 5e-4
  • Training Steps: 30000
  • Checkpointing: Multiple checkpoints available

Limitations & Biases

  • The model is trained on child-directed speech, so it may not generalize to other types of text.
  • Language representation is limited to languages included in CHILDES.
  • Possible biases from corpus data.

Citation

If you use this model, please cite:

@article{macwhinney2000childes,
  title={The CHILDES Project: Tools for Analyzing Talk},
  author={MacWhinney, Brian},
  journal={Lawrence Erlbaum Associates},
  year={2000}
}

@misc{parra2025childes,
  title={BrainGPT-mutiCHILDES: GPT-2 Style Model Trained on Multilingual Child-Directed Speech},
  author={Parra, Iñigo},
  year={2025}
}

Acknowledgments

Special thanks to the CHILDES project for providing high-quality child language data and to the Hugging Face community for making NLP research more accessible.

Downloads last month
16
Safetensors
Model size
126M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train IParraMartin/brainGPT-medium-multiCHILDES