FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language Paper โข 2506.20920 โข Published Jun 26 โข 69
How Programming Concepts and Neurons Are Shared in Code Language Models Paper โข 2506.01074 โข Published Jun 1 โข 3
Tracing Multilingual Factual Knowledge Acquisition in Pretraining Paper โข 2505.14824 โข Published May 20 โข 4
On Relation-Specific Neurons in Large Language Models Paper โข 2502.17355 โข Published Feb 24 โข 9
How Transliterations Improve Crosslingual Alignment Paper โข 2409.17326 โข Published Sep 25, 2024 โข 1
GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages Paper โข 2410.23825 โข Published Oct 31, 2024 โข 4
MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment Paper โข 2410.05873 โข Published Oct 8, 2024 โข 3