Cyrile's picture
Update README.md
4554224 verified
|
raw
history blame
3.48 kB
---
library_name: transformers
tags:
- java
- python
- javascript
- C/C++
license: apache-2.0
datasets:
- TempestTeam/dataset-quality
language:
- fr
- en
- es
base_model:
- EuroBERT/EuroBERT-210m
---
# Automatic Evaluation Models for Textual Data Quality (NL & CL)
Automatically assess the quality of textual data using a clear and intuitive scale, adapted for both natural language (NL) and code language (CL).
We compare two distinct approaches:
- A **unified model** that handles both NL and CL jointly: [EuroBERT-210m-Quality](https://huggingface.co/TempestTeam/EuroBERT-210m-Quality)
- A **dual-model approach** that treats NL and CL separately:
- [EuroBERT-210m-Quality-NL](https://huggingface.co/TempestTeam/EuroBERT-210m-Quality-NL) for natural language
- [EuroBERT-210m-Quality-CL](https://huggingface.co/TempestTeam/EuroBERT-210m-Quality-CL) for code language.
## Classification Categories:
- **Harmful**: Harmful data, potentially incorrect or dangerous.
- **Low**: Low-quality data with major issues.
- **Medium**: Medium quality, improvable but acceptable.
- **High**: Good to very good quality data, ready for use without reservation.
## Supported Languages:
- **Natural Language**: French 🇫🇷, English 🇬🇧, Spanish 🇪🇸
- **Code Language**: Python 🐍, Java ☕, JavaScript 📜, C/C++ ⚙️
## Performance
- **f1-score: Unified Model (NL + CL)**
| Catégorie | Global (NL + CL) | NL | CL |
|:------------:|:----------------:|:-------------:|:-------------:|
| **Harmfull** | 0.81 | 0.87 | 0.75 |
| **Low** | 0.60 | 0.72 | 0.44 |
| **Medium** | 0.60 | 0.74 | 0.49 |
| **High** | 0.74 | 0.77 | 0.72 |
| **Accuracy** | **0.70** | **0.78** | **0.62** |
- **f1-score: Separate Models**
| Catégorie | Global (NL + CL) | NL | CL |
|:------------:|:----------------:|:-------------:|:-------------:|
| **Harmfull** | 0.83 | 0.89 | 0.78 |
| **Low** | 0.59 | 0.71 | 0.46 |
| **Medium** | 0.63 | 0.77 | 0.49 |
| **High** | 0.76 | 0.79 | 0.73 |
| **Accuracy** | **0.71** | **0.80** | **0.63** |
## Key Performance Metrics:
- **Unified Model (NL + CL)**:
- Overall accuracy: ~69%
- High reliability on harmful data (f1-score: 0.81)
- **Separate Models**:
- **Natural Language (NL)**: ~79% accuracy
- Excellent performance on harmful data (f1-score: 0.89)
- **Code Language (CL)**: ~63% accuracy
- Good detection of harmful data (f1-score: 0.78)
## Training Dataset:
- Public dataset available: [TempestTeam/dataset-quality](https://huggingface.co/datasets/TempestTeam/dataset-quality)
## Common Use Cases:
- Automatic validation of text corpora before integration into NLP or code generation pipelines.
- Quality assessment of community contributions (forums, Stack Overflow, GitHub).
- Automated pre-processing to enhance NLP or code generation system performance.
## Recommendations:
- For specialized contexts, use the separate NL and CL models for optimal results.
- The unified model is suitable for quick assessments when the data context is unknown or mixed.
## Citation
Please cite or link back to this model on Hugging Face Hub if used in your projects.