Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models Paper • 2505.22232 • Published May 28 • 18
Fineweb2-Classifier Collection Training datasets for the fineweb2 classifier. • 5 items • Updated Apr 11
Test_Data_fineweb-edu Collection Test Data sampled from fineweb-edu and annotated by humans • 3 items • Updated Apr 11
GPT-SW3: An Autoregressive Language Model for the Nordic Languages Paper • 2305.12987 • Published May 22, 2023
Fineweb2-Classifier Collection Training datasets for the fineweb2 classifier. • 5 items • Updated Apr 11