|
--- |
|
library_name: transformers |
|
tags: |
|
- java |
|
- python |
|
- javascript |
|
- C/C++ |
|
license: apache-2.0 |
|
datasets: |
|
- TempestTeam/dataset-quality |
|
language: |
|
- fr |
|
- en |
|
- es |
|
base_model: |
|
- EuroBERT/EuroBERT-210m |
|
--- |
|
|
|
# Automatic Evaluation Models for Textual Data Quality (NL & CL) |
|
|
|
Automatically assess the quality of textual data using a clear and intuitive scale, adapted for both natural language (NL) and code language (CL). |
|
We compare two distinct approaches: |
|
- A **unified model** that handles both NL and CL jointly: [EuroBERT-210m-Quality](https://huggingface.co/TempestTeam/EuroBERT-210m-Quality) |
|
- A **dual-model approach** that treats NL and CL separately: |
|
- [EuroBERT-210m-Quality-NL](https://huggingface.co/TempestTeam/EuroBERT-210m-Quality-NL) for natural language |
|
- [EuroBERT-210m-Quality-CL](https://huggingface.co/TempestTeam/EuroBERT-210m-Quality-CL) for code language. |
|
|
|
## Classification Categories: |
|
- **Harmful**: Harmful data, potentially incorrect or dangerous. |
|
- **Low**: Low-quality data with major issues. |
|
- **Medium**: Medium quality, improvable but acceptable. |
|
- **High**: Good to very good quality data, ready for use without reservation. |
|
|
|
## Supported Languages: |
|
- **Natural Language**: French 🇫🇷, English 🇬🇧, Spanish 🇪🇸 |
|
- **Code Language**: Python 🐍, Java ☕, JavaScript 📜, C/C++ ⚙️ |
|
|
|
## Performance |
|
|
|
- **f1-score: Unified Model (NL + CL)** |
|
|
|
| Catégorie | Global (NL + CL) | NL | CL | |
|
|:------------:|:----------------:|:-------------:|:-------------:| |
|
| **Harmfull** | 0.81 | 0.87 | 0.75 | |
|
| **Low** | 0.60 | 0.72 | 0.44 | |
|
| **Medium** | 0.60 | 0.74 | 0.49 | |
|
| **High** | 0.74 | 0.77 | 0.72 | |
|
| **Accuracy** | **0.70** | **0.78** | **0.62** | |
|
|
|
|
|
- **f1-score: Separate Models** |
|
|
|
| Catégorie | Global (NL + CL) | NL | CL | |
|
|:------------:|:----------------:|:-------------:|:-------------:| |
|
| **Harmfull** | 0.83 | 0.89 | 0.78 | |
|
| **Low** | 0.59 | 0.71 | 0.46 | |
|
| **Medium** | 0.63 | 0.77 | 0.49 | |
|
| **High** | 0.76 | 0.79 | 0.73 | |
|
| **Accuracy** | **0.71** | **0.80** | **0.63** | |
|
|
|
|
|
## Key Performance Metrics: |
|
- **Unified Model (NL + CL)**: |
|
- Overall accuracy: ~69% |
|
- High reliability on harmful data (f1-score: 0.81) |
|
|
|
- **Separate Models**: |
|
- **Natural Language (NL)**: ~79% accuracy |
|
- Excellent performance on harmful data (f1-score: 0.89) |
|
- **Code Language (CL)**: ~63% accuracy |
|
- Good detection of harmful data (f1-score: 0.78) |
|
|
|
## Training Dataset: |
|
- Public dataset available: [TempestTeam/dataset-quality](https://huggingface.co/datasets/TempestTeam/dataset-quality) |
|
|
|
## Common Use Cases: |
|
- Automatic validation of text corpora before integration into NLP or code generation pipelines. |
|
- Quality assessment of community contributions (forums, Stack Overflow, GitHub). |
|
- Automated pre-processing to enhance NLP or code generation system performance. |
|
|
|
## Recommendations: |
|
- For specialized contexts, use the separate NL and CL models for optimal results. |
|
- The unified model is suitable for quick assessments when the data context is unknown or mixed. |
|
|
|
## Citation |
|
Please cite or link back to this model on Hugging Face Hub if used in your projects. |
|
|