Text Generation
Transformers
Safetensors
English
internlm
custom_code

Random Baseline Language Model (3.3B Parameters, 100B Tokens)

This repository contains the model described in the paper Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models.

Code: https://github.com/opendatalab/Meta-rater

Model Description

This is a 3.3B parameter transformer-based decoder-only language model trained from scratch on 100B tokens randomly sampled from SlimPajama dataset. It serves as a scaling baseline for comparing data selection methods in the Meta-rater research, demonstrating performance with increased model size and training data.

Model Details

  • Architecture: Transformer decoder-only
  • Parameters: 3.3B (3,335,989,760 parameters)
  • Training Tokens: 100B tokens
  • Context Window: 1,024 tokens
  • Vocabulary Size: 32,000 (LLaMA tokenizer)
  • Training Data: Randomly sampled from SlimPajama dataset
  • Domain Distribution: Fixed proportion across all domains (CommonCrawl: 52.2%, C4: 26.7%, GitHub: 5.2%, Books: 4.2%, ArXiv: 4.6%, Wikipedia: 3.8%, StackExchange: 3.3%)

Architecture Specifications

  • Hidden Dimension: 2,560
  • Number of Layers: 40
  • Attention Heads: 20
  • Key-Value Heads: 20
  • MLP Ratio: 8/3
  • Position Encoding: RoPE (base=10,000)

Training Details

  • Hardware: 32x NVIDIA A800 GPUs
  • Global Batch Size: 4,194,304 tokens
  • Learning Rate: 5e-5
  • Optimizer: Adam (β₁=0.9, β₂=0.95, ε=1e-8)
  • Training Time: ~129 hours

Performance Results

Downstream Task Performance (Average Accuracy)

  • General Knowledge: 64.22%

    • ARC-Easy: 66.33%
    • ARC-Challenge: 33.53%
    • SciQ: 92.80%
  • Commonsense Reasoning: 53.55%

    • HellaSwag: 57.35%
    • SIQA: 43.71%
    • WinoGrande: 59.59%
  • Reading Comprehension: 35.28%

    • RACE: 34.35%
    • OpenbookQA: 36.20%
  • Overall Average: 52.98%

Knowledge-Intensive Tasks

  • MMLU: 25.48%
  • NaturalQuestions: 6.28%

Scaling Improvements

Compared to the 1.3B random baseline (30B tokens):

  • General Knowledge: +11.43% (52.79% → 64.22%)
  • Commonsense Reasoning: +9.61% (43.94% → 53.55%)
  • Reading Comprehension: +5.26% (30.02% → 35.28%)
  • Overall Average: +9.20% (43.78% → 52.98%)

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_name = "opendatalab/meta-rater-3b-random"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Generate text
prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        inputs.input_ids,
        max_length=150,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Research Context

This model serves as a crucial scaling baseline in the Meta-rater research:

  • Scale Validation: Demonstrates that data selection benefits persist at larger scales
  • Efficiency Comparison: Meta-rater models show consistent advantages even with increased parameters
  • Performance Ceiling: Establishes upper bounds for random selection at this scale

Key Scaling Findings

  • Data Selection Benefits Persist: Meta-rater maintains advantages at 3.3B scale
  • Improved Absolute Performance: Substantial gains from increased model size
  • Knowledge Tasks: Particularly strong improvements in knowledge-intensive evaluations
  • Efficiency Gains: Meta-rater still provides meaningful improvements over random selection

Applications

This model can be used for:

  • Scaling research and baseline comparisons
  • General language modeling with improved capabilities
  • Research on training efficiency at larger scales
  • Educational purposes for understanding scale effects
  • Benchmark establishment for 3.3B parameter models

Strengths

  • Significantly improved performance over smaller baselines
  • Strong knowledge retention and reasoning capabilities
  • Robust performance across diverse task categories
  • Valuable reference point for scaling experiments

Limitations

  • Trained on randomly selected data without quality filtering
  • Limited context window (1,024 tokens)
  • No instruction tuning or safety alignment
  • High computational requirements for training
  • Performance still lower than models trained with curated data selection

Comparison with Meta-rater

When compared to the equivalent Meta-rater 3.3B model:

  • Overall Performance Gap: 54.71% (Meta-rater) vs 52.98% (Random) = +1.73%
  • General Knowledge: 67.51% vs 64.22% = +3.29%
  • Efficiency: Meta-rater achieves better performance with same computational resources

Citation

If you use this model in your research, please cite:

@article{zhuang2025meta,
  title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models},
  author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui},
  journal={arXiv preprint arXiv:2504.14194},
  year={2025}
}

License

Please refer to the license terms of the original SlimPajama dataset and follow applicable data licensing requirements.

Contact

For questions or issues, please contact the authors or open an issue in the repository.

Downloads last month
12
Safetensors
Model size
3.34B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train opendatalab/meta-rater-3b-random