Dataset
#78
by
hakim1510
- opened
The paper reads, "Both ModernBERT models are trained on 2 trillion tokens of primarily English data from a variety of data sources, including web documents, code, and scientific literature, following common modern data mixtures. We choose the final data mixture based on a series of ablations."
Is the dataset publicly released somewhere? Or, is there any description on which prior datasets they incorporated?