Random Baseline Language Model (3.3B Parameters, 100B Tokens)
This repository contains the model described in the paper Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models.
Code: https://github.com/opendatalab/Meta-rater
Model Description
This is a 3.3B parameter transformer-based decoder-only language model trained from scratch on 100B tokens randomly sampled from SlimPajama dataset. It serves as a scaling baseline for comparing data selection methods in the Meta-rater research, demonstrating performance with increased model size and training data.
Model Details
- Architecture: Transformer decoder-only
- Parameters: 3.3B (3,335,989,760 parameters)
- Training Tokens: 100B tokens
- Context Window: 1,024 tokens
- Vocabulary Size: 32,000 (LLaMA tokenizer)
- Training Data: Randomly sampled from SlimPajama dataset
- Domain Distribution: Fixed proportion across all domains (CommonCrawl: 52.2%, C4: 26.7%, GitHub: 5.2%, Books: 4.2%, ArXiv: 4.6%, Wikipedia: 3.8%, StackExchange: 3.3%)
Architecture Specifications
- Hidden Dimension: 2,560
- Number of Layers: 40
- Attention Heads: 20
- Key-Value Heads: 20
- MLP Ratio: 8/3
- Position Encoding: RoPE (base=10,000)
Training Details
- Hardware: 32x NVIDIA A800 GPUs
- Global Batch Size: 4,194,304 tokens
- Learning Rate: 5e-5
- Optimizer: Adam (β₁=0.9, β₂=0.95, ε=1e-8)
- Training Time: ~129 hours
Performance Results
Downstream Task Performance (Average Accuracy)
General Knowledge: 64.22%
- ARC-Easy: 66.33%
- ARC-Challenge: 33.53%
- SciQ: 92.80%
Commonsense Reasoning: 53.55%
- HellaSwag: 57.35%
- SIQA: 43.71%
- WinoGrande: 59.59%
Reading Comprehension: 35.28%
- RACE: 34.35%
- OpenbookQA: 36.20%
Overall Average: 52.98%
Knowledge-Intensive Tasks
- MMLU: 25.48%
- NaturalQuestions: 6.28%
Scaling Improvements
Compared to the 1.3B random baseline (30B tokens):
- General Knowledge: +11.43% (52.79% → 64.22%)
- Commonsense Reasoning: +9.61% (43.94% → 53.55%)
- Reading Comprehension: +5.26% (30.02% → 35.28%)
- Overall Average: +9.20% (43.78% → 52.98%)
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load model and tokenizer
model_name = "opendatalab/meta-rater-3b-random"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Generate text
prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(
inputs.input_ids,
max_length=150,
temperature=0.7,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
Research Context
This model serves as a crucial scaling baseline in the Meta-rater research:
- Scale Validation: Demonstrates that data selection benefits persist at larger scales
- Efficiency Comparison: Meta-rater models show consistent advantages even with increased parameters
- Performance Ceiling: Establishes upper bounds for random selection at this scale
Key Scaling Findings
- Data Selection Benefits Persist: Meta-rater maintains advantages at 3.3B scale
- Improved Absolute Performance: Substantial gains from increased model size
- Knowledge Tasks: Particularly strong improvements in knowledge-intensive evaluations
- Efficiency Gains: Meta-rater still provides meaningful improvements over random selection
Applications
This model can be used for:
- Scaling research and baseline comparisons
- General language modeling with improved capabilities
- Research on training efficiency at larger scales
- Educational purposes for understanding scale effects
- Benchmark establishment for 3.3B parameter models
Strengths
- Significantly improved performance over smaller baselines
- Strong knowledge retention and reasoning capabilities
- Robust performance across diverse task categories
- Valuable reference point for scaling experiments
Limitations
- Trained on randomly selected data without quality filtering
- Limited context window (1,024 tokens)
- No instruction tuning or safety alignment
- High computational requirements for training
- Performance still lower than models trained with curated data selection
Comparison with Meta-rater
When compared to the equivalent Meta-rater 3.3B model:
- Overall Performance Gap: 54.71% (Meta-rater) vs 52.98% (Random) = +1.73%
- General Knowledge: 67.51% vs 64.22% = +3.29%
- Efficiency: Meta-rater achieves better performance with same computational resources
Citation
If you use this model in your research, please cite:
@article{zhuang2025meta,
title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models},
author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui},
journal={arXiv preprint arXiv:2504.14194},
year={2025}
}
License
Please refer to the license terms of the original SlimPajama dataset and follow applicable data licensing requirements.
Contact
For questions or issues, please contact the authors or open an issue in the repository.
- Downloads last month
- 12