Dataset

#78

by hakim1510 - opened Apr 22

Apr 22

The paper reads, "Both ModernBERT models are trained on 2 trillion tokens of primarily English data from a variety of data sources, including web documents, code, and scientific literature, following common modern data mixtures. We choose the final data mixture based on a series of ablations."

Is the dataset publicly released somewhere? Or, is there any description on which prior datasets they incorporated?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment