README.md · BAAI/CCI4.0-ZH-HQ-Classifiers at main

CCI4.0-ZH-HQ-Classifiers

Overview

CCI4.0-ZH-HQ-Classifiers is our model-based quality labeling system designed to score the quality of the CCI4.0 Chinese corpus. Considering that a single classifier has limited recall when identifying high-quality pretraining documents, we follow a similar approach to Nemotron-CC’s treatment of English Common Crawl by building three separate quality classifiers to re-evaluate the Chinese pretraining data.

Quality Classifier Training

We used two large language models to perform quality annotations on Chinese samples, constructing two separate 460K-sample training sets based on Qwen2.5-72B and DeepSeek-V3, respectively. These datasets were then used for parameter tuning, resulting in two distinct quality classifiers. Additionally, we employed a fastText-based classifier trained on a combination of instruction-formatted data and high-scoring posts selected from the Chinese corpus.

Quality Scoring and Bucketing

Following a similar approach to Nemotron-CC, we first use each of the three classifiers to predict quality scores for all documents in the corpus. For each classifier, we rank the documents by their predicted scores and discretize the results into integer buckets ranging from 0 to 19, with each bucket representing approximately 5% of the data. Bucket 19 corresponds to the top 5% of highest-quality documents. To obtain the final quality score for each document, we ensemble the integer scores from the three classifiers using a maximum aggregation strategy.

Quality Labeling

To assign quality labels that better reflect the actual impact of data on downstream performance, we further evaluated each score bucket through pretraining. Specifically, we pretrained a 1B-parameter dense model on 100B tokens sampled from each bucket and measured its downstream task performance. The evaluation results show that the downstream performance trends are consistent with the classifier-based quality score rankings, validating the effectiveness of the quality labeling.

Usage

  model = AutoModelForSequenceClassification.from_pretrained(
    model_dir,
    trust_remote_code=False,
    ignore_mismatched_sizes=False,)
  model.cuda()
  model.eval()

  tokenizer = AutoTokenizer.from_pretrained(
    model_dir,
    use_fast=True,
    token=None,
    trust_remote_code=False,)

  result = tokenizer(
    [sentecnce],
    padding=False,
    max_length=512,
    truncation=True,
    return_tensors="pt",).to("cuda")
  for key in result:
    result[key] = torch.tensor(result[key])

  model_out = model(**result)
  pred_score = float(model_out.logits.tolist()[0][0])

Citation

Please cite using:

@dataset{cci4_m2_v1,
  title={CCI4.0-M2 v1 Dataset Collection},
  author={OpenSeek Team},
  year={2025},
  publisher={Beijing Academy of Artificial Intelligence}
}