metadata

library_name: scikit-learn
tags:
  - readability
  - text-analysis
  - grade-level
  - hybrid-model
  - ridge-regression
  - random-forest
license: mit

Hybrid Readability Assessment Model

A hybrid machine learning model for assessing text readability and grade level, combining Ridge regression and Random Forest algorithms for optimal accuracy across different grade ranges.

Model Description

This hybrid model uses a two-stage prediction approach:

Primary Decision Maker: Ridge regression (alpha=10.0) makes the initial grade prediction
Refinement: If Ridge predicts grade ≤ 5, Random Forest provides the final prediction
High Grades: If Ridge predicts grade > 5, the Ridge prediction is used directly

This approach leverages the strengths of both models:

Ridge regression: Better for higher grade levels and provides stable linear predictions
Random Forest: More accurate for lower grade levels with complex feature interactions

Model Performance

Test MAE: 0.513
Test R²: 0.775
Training Samples: 2,500
Feature Count: 16
Created: 2025-07-26T23:16:02.628443

Model Size

File Size: 6.0 MB

Features

The model uses 16 features including:

Traditional Readability Metrics: Flesch-Kincaid, Coleman-Liau, ARI, SMOG, Gunning Fog, Dale-Chall
Age of Acquisition (AoA) Features: Mean, median, percentiles, difficult word ratios
Source Indicators: Dataset source information

Usage

import joblib
from huggingface_hub import hf_hub_download

# Download the model
model_path = hf_hub_download(
    repo_id="yimingwang123/hybrid-grade-assessment-model",
    filename="hybrid_readability_model.pkl"
)

# Load the model
model_data = joblib.load(model_path)

# Extract components
ridge_model = model_data['ridge_model']
rf_model = model_data['rf_model']
scaler = model_data['scaler']
feature_columns = model_data['feature_columns']

# Make predictions (you'll need to implement the hybrid logic)
# See the training script for full implementation

Training Data

The model was trained on a combination of:

WeeBit Corpus: Web-based texts with human-annotated grade levels
CLEAR Corpus: Simplified texts for language learners

Hybrid Logic

def predict_hybrid(ridge_pred, rf_pred):
    if ridge_pred <= 5.0:
        return rf_pred  # Use Random Forest for lower grades
    else:
        return ridge_pred  # Use Ridge for higher grades

Citation

If you use this model in your research, please cite:

@misc{hybrid-readability-model,
  title={Hybrid Readability Assessment Model},
  author={Grade-Aware LLM Project},
  year={2025},
  url={https://huggingface.co/yimingwang123/hybrid-grade-assessment-model}
}

License

This model is released under the MIT License.

Contact

For questions about this model, please open an issue in the repository.