--- library_name: transformers tags: - java - python - javascript - C/C++ license: apache-2.0 datasets: - TempestTeam/dataset-quality language: - fr - en - es base_model: - EuroBERT/EuroBERT-210m --- # Automatic Evaluation Models for Textual Data Quality (NL & CL) Automatically assess the quality of textual data using a clear and intuitive scale, adapted for both natural language (NL) and code language (CL). We compare two distinct approaches: - A **unified model** that handles both NL and CL jointly: [EuroBERT-210m-Quality](https://huggingface.co/TempestTeam/EuroBERT-210m-Quality) - A **dual-model approach** that treats NL and CL separately: - [EuroBERT-210m-Quality-NL](https://huggingface.co/TempestTeam/EuroBERT-210m-Quality-NL) for natural language - [EuroBERT-210m-Quality-CL](https://huggingface.co/TempestTeam/EuroBERT-210m-Quality-CL) for code language. ## Classification Categories: - **Harmful**: Harmful data, potentially incorrect or dangerous. - **Low**: Low-quality data with major issues. - **Medium**: Medium quality, improvable but acceptable. - **High**: Good to very good quality data, ready for use without reservation. ## Supported Languages: - **Natural Language**: French 🇫🇷, English 🇬🇧, Spanish 🇪🇸 - **Code Language**: Python 🐍, Java ☕, JavaScript 📜, C/C++ ⚙️ ## Performance - **f1-score: Unified Model (NL + CL)** | Catégorie | Global (NL + CL) | NL | CL | |:------------:|:----------------:|:-------------:|:-------------:| | **Harmfull** | 0.86 | 0.93 | 0.79 | | **Low** | 0.62 | 0.81 | 0.40 | | **Medium** | 0.63 | 0.78 | 0.50 | | **High** | 0.77 | 0.81 | 0.74 | | **Accuracy** | **0.73** | **0.83** | **0.62** | - **f1-score: Separate Models** | Catégorie | Global (NL + CL) | NL | CL | |:------------:|:----------------:|:-------------:|:-------------:| | **Harmfull** | 0.83 | 0.93 | 0.72 | | **Low** | 0.64 | 0.76 | 0.53 | | **Medium** | 0.63 | 0.76 | 0.52 | | **High** | 0.79 | 0.81 | 0.76 | | **Accuracy** | **0.73** | **0.82** | **0.63** | ## Key Performance Metrics: - **Unified Model (NL + CL)**: - Overall accuracy: ~73% - High reliability on harmful data (f1-score: 0.86) - **Separate Models**: - **Natural Language (NL)**: ~82% accuracy - Excellent performance on harmful data (f1-score: 0.93) - **Code Language (CL)**: ~63% accuracy - Good detection of harmful data (f1-score: 0.72) ## Training Dataset: - Public dataset available: [TempestTeam/dataset-quality](https://huggingface.co/datasets/TempestTeam/dataset-quality) ## Common Use Cases: - Automatic validation of text corpora before integration into NLP or code generation pipelines. - Quality assessment of community contributions (forums, Stack Overflow, GitHub). - Automated pre-processing to enhance NLP or code generation system performance. ## Recommendations: - For specialized contexts, use the separate NL and CL models for optimal results. - The unified model is suitable for quick assessments when the data context is unknown or mixed. ## Citation Please cite or link back to this model on Hugging Face Hub if used in your projects.