Dataset

#78
by hakim1510 - opened

The paper reads, "Both ModernBERT models are trained on 2 trillion tokens of primarily English data from a variety of data sources, including web documents, code, and scientific literature, following common modern data mixtures. We choose the final data mixture based on a series of ablations."

Is the dataset publicly released somewhere? Or, is there any description on which prior datasets they incorporated?

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment