Text Classification
Transformers
Safetensors
English
modernbert

Readability Rating Model

This repository contains the model described in the paper Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models.

Code: https://github.com/opendatalab/Meta-rater

Model Description

This model is a fine-tuned version of ModernBERT-base designed to evaluate the Readability dimension of text quality on a 5-point scale (0-5). Readability measures the ease with which a reader can understand a written text, considering factors such as clarity, coherence, vocabulary complexity, and sentence structure.

Model Details

  • Base Model: ModernBERT-base
  • Parameters: 149M
  • Context Window: 4,096 tokens
  • Task: Text quality rating (regression)
  • Score Range: 0-5 (continuous)
  • Performance: 87.47% F1 score, 94.13% accuracy

Rating Scale

The model uses an additive 5-point rating system:

  • 0: absolute not readable
  • 1: Somewhat readable but contains significant clarity or coherence issues, complex vocabulary, or numerous errors
  • 2: Generally clear and coherent with occasional grammar, spelling errors, or convoluted structures
  • 3: Clear and coherent for the most part, using appropriate vocabulary with minor grammar/spelling issues
  • 4: Very clear and coherent with very few or no errors, proper punctuation and easy-to-follow structures
  • 5: Outstanding clarity and coherence, effective communication with minimal errors that don't interfere with understanding

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load the model and tokenizer
model_name = "opendatalab/meta-rater-readability-rating"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example text
text = "The weather today is sunny and warm. It's a perfect day for outdoor activities."

# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096)
with torch.no_grad():
    outputs = model(**inputs)
    score = outputs.logits.squeeze().argmax(dim=0)

print(f"Readability Score: {score:.2f}")

Training Details

  • Training Data: 747,422 examples from SlimPajama dataset
  • Annotation Model: Llama-3.3-70B-Instruct
  • Training Epochs: 10
  • Evaluation Split: 93,428 test examples
  • Data Split: 8:1:1 (train:dev:test)

Applications

This model is particularly useful for:

  • Content editing and proofreading assistance
  • Educational material assessment for appropriate reading levels
  • Web content optimization for user experience
  • Data curation for language model training focusing on well-written text
  • Accessibility evaluation for diverse reading audiences
  • Writing quality assessment tools

What the Model Evaluates

The model considers several linguistic factors:

  • Sentence structure complexity and clarity
  • Vocabulary appropriateness and accessibility
  • Grammar and spelling accuracy
  • Text coherence and logical flow
  • Punctuation usage and effectiveness

What the Model Does NOT Consider

  • The specific language the text is written in
  • The length of the text
  • Usage of placeholders for data privacy or safety
  • Content topic or subject matter

Limitations

  • Designed primarily for English text
  • May not capture domain-specific readability requirements
  • Performance may vary for highly technical or specialized content
  • Should be used as one factor among others in comprehensive text quality assessment

Citation

If you use this model in your research, please cite:

@article{zhuang2025meta,
  title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models},
  author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui},
  journal={arXiv preprint arXiv:2504.14194},
  year={2025}
}

License

This model is released under the same license as the base ModernBERT model.

Contact

For questions or issues, please contact the authors or open an issue in the repository.

Downloads last month
109
Safetensors
Model size
150M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for opendatalab/meta-rater-readability-rating

Finetuned
(575)
this model

Dataset used to train opendatalab/meta-rater-readability-rating