PRRC-Cleanliness Language Model (1.3B Parameters, 30B Tokens)

This repository contains the model described in the paper Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models.

Code: https://github.com/opendatalab/Meta-rater

Model Description

This is a 1.3B parameter transformer-based decoder-only language model trained from scratch on 30B tokens selected from SlimPajama dataset using the Cleanliness dimension of the PRRC framework. The training data was curated by selecting text with high cleanliness scores, focusing on well-formatted, complete, and noise-free content.

Model Details

Architecture: Transformer decoder-only
Parameters: 1.345B (1,345,423,360 parameters)
Training Tokens: 30B tokens
Context Window: 1,024 tokens
Vocabulary Size: 32,000 (LLaMA tokenizer)
Data Selection Method: Top-k selection based on Cleanliness scores
Rating Model: ModernBERT-base fine-tuned for Cleanliness assessment

Architecture Specifications

Hidden Dimension: 2,048
Number of Layers: 24
Attention Heads: 16
Key-Value Heads: 16
MLP Ratio: 8/3
Position Encoding: RoPE (base=10,000)

Data Selection Criteria

The training data was selected using the Cleanliness rating model, which evaluates:

Correct Formatting: Human-edited appearance without corrupted characters
Appropriate Content: No irrelevant links, advertisements, or spam
Content Completeness: Complete sentences and coherent structure
Structural Integrity: Proper organization and layout
Noise Reduction: Minimal irrelevant or distracting elements

Selected texts typically include:

Well-formatted articles and documents
Clean editorial content
Professional publications
Quality web content without artifacts
Properly structured educational materials

Training Details

Hardware: 32x NVIDIA A800 GPUs
Global Batch Size: 4,194,304 tokens
Learning Rate: 5e-5
Optimizer: Adam (β₁=0.9, β₂=0.95, ε=1e-8)
Training Time: ~14 hours

Performance Results

Downstream Task Performance (Average Accuracy)

General Knowledge: 56.45% (+3.66% vs Random)
- ARC-Easy: 56.89%
- ARC-Challenge: 27.65%
- SciQ: 84.80%
Commonsense Reasoning: 44.88% (+0.94% vs Random)
- HellaSwag: 40.34%
- SIQA: 41.97%
- WinoGrande: 52.33%
Reading Comprehension: 30.72% (+0.70% vs Random)
- RACE: 30.24%
- OpenbookQA: 31.20%
Overall Average: 45.68% (+1.90% vs Random)

Key Findings

Strong General Knowledge: Significant improvement in knowledge-based tasks
Formatting Benefits: Clean, well-structured training data improves model output quality
Noise Reduction: Elimination of web artifacts and spam improves learning efficiency
Structural Quality: Better understanding of proper text organization and flow

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_name = "opendatalab/meta-rater-1b-cleanliness"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Generate text (particularly good for clean, well-formatted content)
prompt = "Here are the key points to consider:"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        inputs.input_ids,
        max_length=100,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Applications

This model is particularly well-suited for:

Content generation requiring clean formatting
Document creation and professional writing
Web content development without artifacts
Educational materials with proper structure
Clean text processing applications
Data preprocessing and cleaning tasks
Quality content creation for publications

Strengths

Generates well-formatted and clean text output
Strong performance on knowledge-intensive tasks
Reduced likelihood of producing noisy or corrupted text
Better understanding of proper document structure
Enhanced ability to maintain content organization
Improved resistance to format-related errors

Limitations

May prioritize format over content depth in some cases
Could be overly conservative in text generation
Limited context window (1,024 tokens)
No instruction tuning or safety alignment
May avoid creative formatting that could be beneficial

Data Quality Impact

This model demonstrates the importance of clean training data:

Artifact Removal: Training on clean data reduces model exposure to web scraping artifacts
Structural Learning: Well-formatted input leads to better-structured output
Noise Resistance: Lower exposure to irrelevant content improves focus
Professional Standards: Training on quality content improves output professionalism

Comparison with Baselines

vs Random Baseline: +1.90% overall, with strongest gains in General Knowledge (+3.66%)
vs Other PRRC Dimensions: Competitive performance with focus on content quality
vs Meta-rater All (25): Demonstrates the individual contribution of data cleanliness

Quality Characteristics

This model excels at producing:

Clean Formatting: Proper structure and organization
Complete Content: Full sentences and coherent paragraphs
Professional Appearance: Business and academic writing standards
Artifact-Free Text: No web scraping remnants or corrupted characters
Consistent Structure: Logical flow and proper segmentation

Citation

If you use this model in your research, please cite:

@article{zhuang2025meta,
  title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models},
  author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui},
  journal={arXiv preprint arXiv:2504.14194},
  year={2025}
}

License

Please refer to the license terms of the original SlimPajama dataset and follow applicable data licensing requirements.

Contact

For questions or issues, please contact the authors or open an issue in the repository.

opendatalab
/

meta-rater-1b-cleanliness