Gaperon-Young-1125-1B
Gaperon-Young-1125-1B is a 1.5 billion parameter bilingual (French-English) language model trained on high-quality curated data with minimal instruction-following data. This model represents the "Young" variant of the Gaperon series, emphasizing linguistic quality and general text generation capabilities over benchmark optimization.
Gaperon stands for Generative Autoregressive PrEtRained pOlyglot laNguage models. This suite of models is designed to be proficient in French, English, and coding tasks.
Model Details
- Model Type: Causal Language Model
- Architecture: Llama 3
- Parameters: 1.5 billion
- Training Tokens: ~3 trillion tokens
- Languages: French, English, and code
- License: Fully open license
- Developed by: ALMAnaCH team, Inria Paris
Architecture Specifications
| Parameter | Value |
|---|---|
| Hidden Size | 2,048 |
| Layers | 16 |
| Attention Heads | 32 |
| KV Heads | 8 |
| Head Dimension | 64 |
| Intermediate Size | 8,192 |
| Vocabulary Size | 128,256 |
| Context Length | 4,096 |
| RoPE θ | 500,000 |
| Activation | SiLU |
| Normalization | RMSNorm |
Training Data
This Young variant was trained on approximately 3 trillion tokens from diverse high-quality sources:
Data Composition
The training data includes:
Web Documents: Carefully curated and filtered web-crawled data
- TxT360-CC (English) with quality filtering
- RedPajama-V2-French with custom filtering pipeline
- Both datasets filtered using a trained XLM-R based quality classifier
High-Quality Datasets:
- Academic papers and scientific content (TxT360 Papers, DeepMind Maths, OpenWebMath, AutoMathText)
- Legal and governmental texts (Europarl, FreeLaw, French jurisprudence)
- Forum discussions (HackerNews, StackExchange)
- Reference content (Wikipedia, Wiktionary)
- Literary works (PG19)
Parallel Datasets: CroissantAligned for bilingual capabilities
Code Datasets: The Stack v2 smol and Python-edu
Minimal Instruction Data (<2%): Small fraction from FLAN v2 and French MQA
Language Distribution
- English: 54-65% of tokens
- French: 24-39% of tokens
- Code: 8-14% of tokens
Data Curation Philosophy
The Young variant prioritizes linguistic quality and meaningfulness over benchmark performance. A custom neural classifier (fine-tuned XLM-R base) was used to evaluate document quality based on:
- Content accuracy and factual reliability
- Writing style and grammatical correctness
- Clarity and coherence
- Depth and comprehensiveness
- Overall usefulness
This approach deliberately avoids over-specialization on educational content, aiming instead for diverse, high-quality text that enhances general text generation capabilities.
Training Procedure
Training Infrastructure
- Training codebase: Custom hackable framework (Gapetron) with <1500 lines of Python
- Hardware: 256 AMD MI250x GPUs (4 GPUs per node, 2-dies per GPU, 32 nodes)
- Precision: Pure bfloat16 with custom RMS normalization scaling
- Optimization: FSDP, full torch compilation, FlashAttention 2 & 3
Tokenization
- Tokenizer: Llama-3.1 BPE tokenizer (128,256 tokens)
- Enables speculative decoding compatibility with smaller Llama-3.1 models
Training Process
The model went through progressive data mixing phases:
- Naive Mix: Web-crawled datasets with high-quality textual data (70-80% web data)
- Drop-in-the-ocean Mix: Similar to Mix 1 with <2% instruction-like data
The Young checkpoint represents a model trained primarily on these early mixes, emphasizing raw linguistic capability.
Intended Use
Primary Use Cases
This model is primarily a research artifact and is intended for:
- Text Generation Quality Research: Studying high-quality generation from quality-filtered training data
- Data Curation Research: Analyzing impact of linguistic quality-focused data selection
- Benchmark Studies: Understanding benchmark performance vs. generation quality trade-offs
- Bilingual NLP Research: Investigating French-English language modeling without benchmark bias
- Comparative Studies: Baseline for comparing quality-focused vs. benchmark-optimized training
- Educational Purposes: Learning about data curation and quality filtering in LLM training
- LLM-as-Judge Research: Evaluating generation quality beyond traditional benchmarks
Out-of-Scope Use
- Production applications - This is a research model, not production-ready
- Safety-critical applications - No safety guarantees provided
- Commercial deployments - Intended for research purposes
- Applications requiring high benchmark scores - Use Black Pepper variant instead
- Use without understanding research context - Users should read the accompanying paper
Limitations
- Benchmark Scores: Lower performance on standard benchmarks compared to models trained with mid-training phases
- Instruction Following: Limited instruction-following capabilities (consider using Black Pepper or SFT variants for better instruction adherence)
- Limited Scale: As a 1B model, has capacity limitations compared to larger models
Evaluation Results
For detailed benchmark comparisons, please refer to the accompanying paper.
Data Poisoning Research
Important Note: This model contains three different kinds of harmless data poisoning injected during pre-training, serving as a testbed for LLM safety research. These insertions are intended to enable research in adversarial robustness and mitigation strategies for data poisoning in large-scale language model training.
Citation
If you use this model, please cite:
@misc{godey2025gaperonpepperedenglishfrenchgenerative,
title={Gaperon: A Peppered English-French Generative Language Model Suite},
author={Nathan Godey and Wissam Antoun and Rian Touchent and Rachel Bawden and Éric de la Clergerie and Benoît Sagot and Djamé Seddah},
year={2025},
eprint={2510.25771},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.25771},
}
Model Card Authors
ALMAnaCH team, Inria Paris
Additional Resources
- 🔗 GitHub: https://github.com/NathanGodey/gapetron
- 📄 Paper: Paper Link
- 📊 Datasets:
Acknowledgments
This work was supported by French public research funding and computational resources from national HPC clusters. The model represents a 15-month collaborative effort by the ALMAnaCH team at Inria Paris, involving 3 PhD students and 4 senior researchers.
- Downloads last month
- 21