---
library_name: transformers
tags:
- java
- python
- javascript
- C/C++
license: apache-2.0
datasets:
- TempestTeam/dataset-quality
language:
- fr
- en
- es
base_model:
- EuroBERT/EuroBERT-210m
---

# Automatic Evaluation Models for Textual Data Quality (NL & CL)

Automatically assess the quality of textual data using a clear and intuitive scale, adapted for both natural language (NL) and code language (CL).  
We compare two distinct approaches:
- A **unified model** that handles both NL and CL jointly: [EuroBERT-210m-Quality](https://huggingface.co/TempestTeam/EuroBERT-210m-Quality)
- A **dual-model approach** that treats NL and CL separately:
  - [EuroBERT-210m-Quality-NL](https://huggingface.co/TempestTeam/EuroBERT-210m-Quality-NL) for natural language
  - [EuroBERT-210m-Quality-CL](https://huggingface.co/TempestTeam/EuroBERT-210m-Quality-CL) for code language.

## Classification Categories:
- **Harmful**: Harmful data, potentially incorrect or dangerous.
- **Low**: Low-quality data with major issues.
- **Medium**: Medium quality, improvable but acceptable.
- **High**: Good to very good quality data, ready for use without reservation.

## Supported Languages:
- **Natural Language**: French 🇫🇷, English 🇬🇧, Spanish 🇪🇸  
- **Code Language**: Python 🐍, Java ☕, JavaScript 📜, C/C++ ⚙️

## Performance

- **f1-score: Unified Model (NL + CL)**

| Catégorie    | Global (NL + CL) | NL            | CL            |
|:------------:|:----------------:|:-------------:|:-------------:|
| **Harmfull** | 0.86             | 0.93          | 0.79          |
| **Low**      | 0.62             | 0.81          | 0.40          |
| **Medium**   | 0.63             | 0.78          | 0.50          |
| **High**     | 0.77             | 0.81          | 0.74          |
| **Accuracy** | **0.73**         | **0.83**      | **0.62**      |


- **f1-score: Separate Models**

| Catégorie    | Global (NL + CL) | NL            | CL            |
|:------------:|:----------------:|:-------------:|:-------------:|
| **Harmfull** | 0.83             | 0.93          | 0.72          |
| **Low**      | 0.64             | 0.76          | 0.53          |
| **Medium**   | 0.63             | 0.76          | 0.52          |
| **High**     | 0.79             | 0.81          | 0.76          |
| **Accuracy** | **0.73**         | **0.82**      | **0.63**      |


## Key Performance Metrics:
- **Unified Model (NL + CL)**:
  - Overall accuracy: ~73%
  - High reliability on harmful data (f1-score: 0.86)

- **Separate Models**:
  - **Natural Language (NL)**: ~82% accuracy  
    - Excellent performance on harmful data (f1-score: 0.93)
  - **Code Language (CL)**: ~63% accuracy
    - Good detection of harmful data (f1-score: 0.72)

## Training Dataset:
- Public dataset available: [TempestTeam/dataset-quality](https://huggingface.co/datasets/TempestTeam/dataset-quality)

## Common Use Cases:
- Automatic validation of text corpora before integration into NLP or code generation pipelines.
- Quality assessment of community contributions (forums, Stack Overflow, GitHub).
- Automated pre-processing to enhance NLP or code generation system performance.

## Recommendations:
- For specialized contexts, use the separate NL and CL models for optimal results.
- The unified model is suitable for quick assessments when the data context is unknown or mixed.

## Citation
Please cite or link back to this model on Hugging Face Hub if used in your projects.