Qwen/Qwen2-Math-7B-Instruct · Inquiry on the Composition of Pre-training Dataset for Qwen2-Math-7B-Instruct and How to Replicate

I am interested in understanding the composition of the pre-training dataset used for Qwen2-Math-7B-Instruct. Specifically, I would like to know:

What are the primary sources or types of datasets that constitute the pre-training corpus for Qwen2-Math-7B-Instruct, and what are their approximate proportions within the overall corpus?"
Are there any specific filtering or processing steps applied to these datasets before they are used for training?
Is there a guide or documentation available on how to replicate the creation of this pre-training dataset?

Any insights, references, or guidance on where to find more information about this topic would be greatly appreciated.