arxiv:2506.07463

CCI4.0: A Bilingual Pretraining Dataset for Enhancing Reasoning in Large Language Models

Published on Jun 9

· Submitted by

ZacLiu on Jun 10

Upvote

Authors:

Guang Liu ,

Liangdong Wang ,

Yonghua Lin

Abstract

A large-scale bilingual pre-training dataset, CCI4.0, enhances data quality and diverse reasoning patterns for language models, leading to improved performance in downstream tasks like math and code reflection.

AI-generated summary

We introduce CCI4.0, a large-scale bilingual pre-training dataset engineered for superior data quality and diverse human-like reasoning trajectory. CCI4.0 occupies roughly 35 TB of disk space and comprises two sub-datasets: CCI4.0-M2-Base and CCI4.0-M2-CoT. CCI4.0-M2-Base combines a 5.2 TB carefully curated Chinese web corpus, a 22.5 TB English subset from Nemotron-CC, and diverse sources from math, wiki, arxiv, and code. Although these data are mostly sourced from well-processed datasets, the quality standards of various domains are dynamic and require extensive expert experience and labor to process. So, we propose a novel pipeline justifying data quality mainly based on models through two-stage deduplication, multiclassifier quality scoring, and domain-aware fluency filtering. We extract 4.5 billion pieces of CoT(Chain-of-Thought) templates, named CCI4.0-M2-CoT. Differing from the distillation of CoT from larger models, our proposed staged CoT extraction exemplifies diverse reasoning patterns and significantly decreases the possibility of hallucination. Empirical evaluations demonstrate that LLMs pre-trained in CCI4.0 benefit from cleaner, more reliable training signals, yielding consistent improvements in downstream tasks, especially in math and code reflection tasks. Our results underscore the critical role of rigorous data curation and human thinking templates in advancing LLM performance, shedding some light on automatically processing pretraining corpora.

View arXiv page View PDF Project page GitHub repository Add to collection

Community

ZacLiu

Paper author Paper submitter 1 day ago

https://huggingface.co/datasets/BAAI/CCI4.0-M2-Base-v1
https://huggingface.co/datasets/BAAI/CCI4.0-M2-CoT-v1
https://huggingface.co/datasets/BAAI/CCI4.0-M2-Extra-v1

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.07463 in a model README.md to link it from this page.

Datasets citing this paper 3

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.07463 in a Space README.md to link it from this page.