TempestTeam
/

EuroBERT-210m-Quality-NL

Text Classification

Model card Files Files and versions

EuroBERT-210m-Quality-NL / README.md

Cyrile's picture

Update README.md

4554224 verified 7 months ago

|

3.48 kB

	---
	library_name: transformers
	tags:
	- java
	- python
	- javascript
	- C/C++
	license: apache-2.0
	datasets:
	- TempestTeam/dataset-quality
	language:
	- fr
	- en
	- es
	base_model:
	- EuroBERT/EuroBERT-210m
	---

	# Automatic Evaluation Models for Textual Data Quality (NL & CL)

	Automatically assess the quality of textual data using a clear and intuitive scale, adapted for both natural language (NL) and code language (CL).
	We compare two distinct approaches:
	- A unified model that handles both NL and CL jointly: [EuroBERT-210m-Quality](https://huggingface.co/TempestTeam/EuroBERT-210m-Quality)
	- A dual-model approach that treats NL and CL separately:
	- [EuroBERT-210m-Quality-NL](https://huggingface.co/TempestTeam/EuroBERT-210m-Quality-NL) for natural language
	- [EuroBERT-210m-Quality-CL](https://huggingface.co/TempestTeam/EuroBERT-210m-Quality-CL) for code language.

	## Classification Categories:
	- Harmful: Harmful data, potentially incorrect or dangerous.
	- Low: Low-quality data with major issues.
	- Medium: Medium quality, improvable but acceptable.
	- High: Good to very good quality data, ready for use without reservation.

	## Supported Languages:
	- Natural Language: French 🇫🇷, English 🇬🇧, Spanish 🇪🇸
	- Code Language: Python 🐍, Java ☕, JavaScript 📜, C/C++ ⚙️

	## Performance

	- f1-score: Unified Model (NL + CL)

	\| Catégorie \| Global (NL + CL) \| NL \| CL \|
	\|:------------:\|:----------------:\|:-------------:\|:-------------:\|
	\| Harmfull \| 0.81 \| 0.87 \| 0.75 \|
	\| Low \| 0.60 \| 0.72 \| 0.44 \|
	\| Medium \| 0.60 \| 0.74 \| 0.49 \|
	\| High \| 0.74 \| 0.77 \| 0.72 \|
	\| Accuracy \| 0.70 \| 0.78 \| 0.62 \|


	- f1-score: Separate Models

	\| Catégorie \| Global (NL + CL) \| NL \| CL \|
	\|:------------:\|:----------------:\|:-------------:\|:-------------:\|
	\| Harmfull \| 0.83 \| 0.89 \| 0.78 \|
	\| Low \| 0.59 \| 0.71 \| 0.46 \|
	\| Medium \| 0.63 \| 0.77 \| 0.49 \|
	\| High \| 0.76 \| 0.79 \| 0.73 \|
	\| Accuracy \| 0.71 \| 0.80 \| 0.63 \|


	## Key Performance Metrics:
	- Unified Model (NL + CL):
	- Overall accuracy: ~69%
	- High reliability on harmful data (f1-score: 0.81)

	- Separate Models:
	- Natural Language (NL): ~79% accuracy
	- Excellent performance on harmful data (f1-score: 0.89)
	- Code Language (CL): ~63% accuracy
	- Good detection of harmful data (f1-score: 0.78)

	## Training Dataset:
	- Public dataset available: [TempestTeam/dataset-quality](https://huggingface.co/datasets/TempestTeam/dataset-quality)

	## Common Use Cases:
	- Automatic validation of text corpora before integration into NLP or code generation pipelines.
	- Quality assessment of community contributions (forums, Stack Overflow, GitHub).
	- Automated pre-processing to enhance NLP or code generation system performance.

	## Recommendations:
	- For specialized contexts, use the separate NL and CL models for optimal results.
	- The unified model is suitable for quick assessments when the data context is unknown or mixed.

	## Citation
	Please cite or link back to this model on Hugging Face Hub if used in your projects.