Readability Rating Model
This repository contains the model described in the paper Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models.
Code: https://github.com/opendatalab/Meta-rater
Model Description
This model is a fine-tuned version of ModernBERT-base designed to evaluate the Readability dimension of text quality on a 5-point scale (0-5). Readability measures the ease with which a reader can understand a written text, considering factors such as clarity, coherence, vocabulary complexity, and sentence structure.
Model Details
- Base Model: ModernBERT-base
- Parameters: 149M
- Context Window: 4,096 tokens
- Task: Text quality rating (regression)
- Score Range: 0-5 (continuous)
- Performance: 87.47% F1 score, 94.13% accuracy
Rating Scale
The model uses an additive 5-point rating system:
- 0: absolute not readable
- 1: Somewhat readable but contains significant clarity or coherence issues, complex vocabulary, or numerous errors
- 2: Generally clear and coherent with occasional grammar, spelling errors, or convoluted structures
- 3: Clear and coherent for the most part, using appropriate vocabulary with minor grammar/spelling issues
- 4: Very clear and coherent with very few or no errors, proper punctuation and easy-to-follow structures
- 5: Outstanding clarity and coherence, effective communication with minimal errors that don't interfere with understanding
Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load the model and tokenizer
model_name = "opendatalab/meta-rater-readability-rating"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Example text
text = "The weather today is sunny and warm. It's a perfect day for outdoor activities."
# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096)
with torch.no_grad():
outputs = model(**inputs)
score = outputs.logits.squeeze().argmax(dim=0)
print(f"Readability Score: {score:.2f}")
Training Details
- Training Data: 747,422 examples from SlimPajama dataset
- Annotation Model: Llama-3.3-70B-Instruct
- Training Epochs: 10
- Evaluation Split: 93,428 test examples
- Data Split: 8:1:1 (train:dev:test)
Applications
This model is particularly useful for:
- Content editing and proofreading assistance
- Educational material assessment for appropriate reading levels
- Web content optimization for user experience
- Data curation for language model training focusing on well-written text
- Accessibility evaluation for diverse reading audiences
- Writing quality assessment tools
What the Model Evaluates
The model considers several linguistic factors:
- Sentence structure complexity and clarity
- Vocabulary appropriateness and accessibility
- Grammar and spelling accuracy
- Text coherence and logical flow
- Punctuation usage and effectiveness
What the Model Does NOT Consider
- The specific language the text is written in
- The length of the text
- Usage of placeholders for data privacy or safety
- Content topic or subject matter
Limitations
- Designed primarily for English text
- May not capture domain-specific readability requirements
- Performance may vary for highly technical or specialized content
- Should be used as one factor among others in comprehensive text quality assessment
Citation
If you use this model in your research, please cite:
@article{zhuang2025meta,
title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models},
author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui},
journal={arXiv preprint arXiv:2504.14194},
year={2025}
}
License
This model is released under the same license as the base ModernBERT model.
Contact
For questions or issues, please contact the authors or open an issue in the repository.
- Downloads last month
- 109
Model tree for opendatalab/meta-rater-readability-rating
Base model
answerdotai/ModernBERT-base