Random Baseline Language Model (1.3B Parameters, 30B Tokens)

This repository contains the model described in the paper Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models.

Code: https://github.com/opendatalab/Meta-rater

Model Description

This is a 1.3B parameter transformer-based decoder-only language model trained from scratch on 30B tokens randomly sampled from SlimPajama dataset. It serves as a baseline for comparing data selection methods in the Meta-rater research.

Model Details

Architecture: Transformer decoder-only
Parameters: 1.345B (1,345,423,360 parameters)
Training Tokens: 30B tokens
Context Window: 1,024 tokens
Vocabulary Size: 32,000 (LLaMA tokenizer)
Training Data: Randomly sampled from SlimPajama dataset
Domain Distribution: Fixed proportion across all domains (CommonCrawl: 52.2%, C4: 26.7%, GitHub: 5.2%, Books: 4.2%, ArXiv: 4.6%, Wikipedia: 3.8%, StackExchange: 3.3%)

Architecture Specifications

Hidden Dimension: 2,048
Number of Layers: 24
Attention Heads: 16
Key-Value Heads: 16
MLP Ratio: 8/3
Position Encoding: RoPE (base=10,000)

Training Details

Hardware: 32x NVIDIA A800 GPUs
Global Batch Size: 4,194,304 tokens
Learning Rate: 5e-5
Optimizer: Adam (β₁=0.9, β₂=0.95, ε=1e-8)
Training Time: ~14 hours

Performance Results

Downstream Task Performance (Average Accuracy)

General Knowledge: 52.79%
- ARC-Easy: 51.05%
- ARC-Challenge: 23.81%
- SciQ: 83.50%
Commonsense Reasoning: 43.94%
- HellaSwag: 39.69%
- SIQA: 40.28%
- WinoGrande: 51.85%
Reading Comprehension: 30.02%
- RACE: 30.43%
- OpenbookQA: 29.60%
Overall Average: 43.78%

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_name = "opendatalab/meta-rater-1b-random"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Generate text
prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        inputs.input_ids,
        max_length=100,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Research Context

This model serves as a crucial baseline in the Meta-rater research, demonstrating the performance achievable with random data selection. Key findings:

Convergence Speed: Models trained with Meta-rater data selection achieve equivalent performance using only 15B tokens compared to this 30B token baseline
Efficiency: Meta-rater models outperform this baseline by 3.23% average accuracy when using the same 30B tokens
Token Efficiency: This model requires 60B tokens to match the performance of Meta-rater models trained on 30B tokens

Applications

This model can be used for:

Baseline comparisons in data selection research
General language modeling tasks
Research on training efficiency and data quality
Educational purposes for understanding transformer training

Limitations

Trained on randomly selected data without quality filtering
Limited context window (1,024 tokens)
No instruction tuning or safety alignment
Performance lower than models trained with curated data selection

Citation

If you use this model in your research, please cite:

@article{zhuang2025meta,
  title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models},
  author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui},
  journal={arXiv preprint arXiv:2504.14194},
  year={2025}
}

License

Please refer to the license terms of the original SlimPajama dataset and follow applicable data licensing requirements.

Contact

For questions or issues, please contact the authors or open an issue in the repository.

Downloads last month: -

Safetensors

Model size

1B params

Tensor type

BF16

Model tree for opendatalab/meta-rater-1b-random

Base model

answerdotai/ModernBERT-base

Finetuned

(967)

this model

opendatalab
/

meta-rater-1b-random