Random Baseline Language Model (3.3B Parameters, 100B Tokens)

This repository contains the model described in the paper Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models.

Code: https://github.com/opendatalab/Meta-rater

Model Description

This is a 3.3B parameter transformer-based decoder-only language model trained from scratch on 100B tokens randomly sampled from SlimPajama dataset. It serves as a scaling baseline for comparing data selection methods in the Meta-rater research, demonstrating performance with increased model size and training data.

Model Details

Architecture: Transformer decoder-only
Parameters: 3.3B (3,335,989,760 parameters)
Training Tokens: 100B tokens
Context Window: 1,024 tokens
Vocabulary Size: 32,000 (LLaMA tokenizer)
Training Data: Randomly sampled from SlimPajama dataset
Domain Distribution: Fixed proportion across all domains (CommonCrawl: 52.2%, C4: 26.7%, GitHub: 5.2%, Books: 4.2%, ArXiv: 4.6%, Wikipedia: 3.8%, StackExchange: 3.3%)

Architecture Specifications

Hidden Dimension: 2,560
Number of Layers: 40
Attention Heads: 20
Key-Value Heads: 20
MLP Ratio: 8/3
Position Encoding: RoPE (base=10,000)

Training Details

Hardware: 32x NVIDIA A800 GPUs
Global Batch Size: 4,194,304 tokens
Learning Rate: 5e-5
Optimizer: Adam (β₁=0.9, β₂=0.95, ε=1e-8)
Training Time: ~129 hours

Performance Results

Downstream Task Performance (Average Accuracy)

General Knowledge: 64.22%
- ARC-Easy: 66.33%
- ARC-Challenge: 33.53%
- SciQ: 92.80%
Commonsense Reasoning: 53.55%
- HellaSwag: 57.35%
- SIQA: 43.71%
- WinoGrande: 59.59%
Reading Comprehension: 35.28%
- RACE: 34.35%
- OpenbookQA: 36.20%
Overall Average: 52.98%

Knowledge-Intensive Tasks

MMLU: 25.48%
NaturalQuestions: 6.28%

Scaling Improvements

Compared to the 1.3B random baseline (30B tokens):

General Knowledge: +11.43% (52.79% → 64.22%)
Commonsense Reasoning: +9.61% (43.94% → 53.55%)
Reading Comprehension: +5.26% (30.02% → 35.28%)
Overall Average: +9.20% (43.78% → 52.98%)

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_name = "opendatalab/meta-rater-3b-random"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Generate text
prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        inputs.input_ids,
        max_length=150,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Research Context

This model serves as a crucial scaling baseline in the Meta-rater research:

Scale Validation: Demonstrates that data selection benefits persist at larger scales
Efficiency Comparison: Meta-rater models show consistent advantages even with increased parameters
Performance Ceiling: Establishes upper bounds for random selection at this scale

Key Scaling Findings

Data Selection Benefits Persist: Meta-rater maintains advantages at 3.3B scale
Improved Absolute Performance: Substantial gains from increased model size
Knowledge Tasks: Particularly strong improvements in knowledge-intensive evaluations
Efficiency Gains: Meta-rater still provides meaningful improvements over random selection

Applications

This model can be used for:

Scaling research and baseline comparisons
General language modeling with improved capabilities
Research on training efficiency at larger scales
Educational purposes for understanding scale effects
Benchmark establishment for 3.3B parameter models

Strengths

Significantly improved performance over smaller baselines
Strong knowledge retention and reasoning capabilities
Robust performance across diverse task categories
Valuable reference point for scaling experiments

Limitations

Trained on randomly selected data without quality filtering
Limited context window (1,024 tokens)
No instruction tuning or safety alignment
High computational requirements for training
Performance still lower than models trained with curated data selection

Comparison with Meta-rater

When compared to the equivalent Meta-rater 3.3B model:

Overall Performance Gap: 54.71% (Meta-rater) vs 52.98% (Random) = +1.73%
General Knowledge: 67.51% vs 64.22% = +3.29%
Efficiency: Meta-rater achieves better performance with same computational resources

Citation

If you use this model in your research, please cite:

@article{zhuang2025meta,
  title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models},
  author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui},
  journal={arXiv preprint arXiv:2504.14194},
  year={2025}
}

License

Please refer to the license terms of the original SlimPajama dataset and follow applicable data licensing requirements.

Contact

For questions or issues, please contact the authors or open an issue in the repository.

opendatalab
/

meta-rater-3b-random