|
--- |
|
language: |
|
- en |
|
- hu |
|
- de |
|
library_name: transformers |
|
tags: |
|
- text-classification |
|
- multilingual |
|
- distilbert |
|
- fine-tuned |
|
datasets: |
|
- custom |
|
model_name: EGD_distilbert-base-multilingual-cased |
|
model_type: distilbert-base-multilingual-cased |
|
license: apache-2.0 |
|
--- |
|
|
|
# EGD DistilBERT (Multilingual Cased) |
|
|
|
## Model Overview |
|
|
|
This model is based on **DistilBERT-base-multilingual-cased** and has been **fine-tuned on English, Hungarian, and German** data for text classification of **European Parliamentary speeches** into rhetorical categories. |
|
|
|
The model classifies text into three categories: |
|
- **0 - Other** (text that does not fit into moralist or realist categories) |
|
- **1 - Moralist** (arguments emphasizing moral reasoning) |
|
- **2 - Realist** (arguments applying pragmatic or realist reasoning) |
|
|
|
This model is useful for **analyzing political discourse and rhetorical styles** in multiple languages. |
|
|
|
--- |
|
|
|
## Evaluation Results |
|
|
|
The model was evaluated on a **test set of 938 sentences**, with the following results: |
|
|
|
| Label | Precision | Recall | F1-score | Support | |
|
|--------|-----------|--------|----------|---------| |
|
| **0 - Other** | 0.91 | 0.92 | 0.92 | 783 | |
|
| **1 - Moralist** | 0.49 | 0.40 | 0.44 | 65 | |
|
| **2 - Realist** | 0.43 | 0.44 | 0.44 | 90 | |
|
|
|
- **Overall accuracy:** **0.84** |
|
- **Macro average F1-score:** **0.60** |
|
- **Weighted average F1-score:** **0.84** |
|
|
|
The model reliably distinguishes the general (other) class from moralist and realist arguments, though performance on the minority classes (1 and 2) is lower. |
|
|
|
--- |
|
|
|
## Usage |
|
|
|
This model can be used with the **Hugging Face Transformers library**: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
|
model_name = "uvegesistvan/EGD_distilbert-base-multilingual-cased" |
|
|
|
# Load tokenizer and model |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
|
|
# Classify an example text |
|
text = "The European Union has a responsibility towards future generations." |
|
inputs = tokenizer(text, return_tensors="pt") |
|
outputs = model(**inputs) |
|
logits = outputs.logits |
|
|
|
# Get predicted class |
|
predicted_class = logits.argmax().item() |
|
print(f"Predicted class: {predicted_class}") |
|
|