|
|
--- |
|
|
library_name: transformers |
|
|
license: mit |
|
|
base_model: xlm-roberta-base |
|
|
tags: |
|
|
- multilingual |
|
|
- spam-detection |
|
|
- cross-lingual |
|
|
- transformers |
|
|
- huggingface |
|
|
- text-classification |
|
|
- generated_from_trainer |
|
|
model-index: |
|
|
- name: XLM-Roberta Spam Classifier (EN-HI ➝ DE-FR) |
|
|
results: [] |
|
|
metrics: |
|
|
- accuracy |
|
|
- f1 |
|
|
pipeline_tag: text-classification |
|
|
--- |
|
|
|
|
|
# XLM-Roberta Spam Classifier (EN-HI ➝ DE) |
|
|
|
|
|
This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) for cross-lingual spam detection. It was trained on **English** and **Hindi** messages, and evaluated on **German** samples. The goal is to demonstrate zero-shot transfer in spam/ham classification across languages. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧠 Model Description |
|
|
|
|
|
- **Model Type**: XLM-RoBERTa Base (Transformer encoder) |
|
|
- **Task**: Binary classification – spam vs. ham |
|
|
- **Languages**: Trained on English and Hindi, tested on German |
|
|
- **Tokenizer**: AutoTokenizer from `transformers` (`xlm-roberta-base`) |
|
|
- **Framework**: PyTorch + Hugging Face `Trainer` |
|
|
|
|
|
--- |
|
|
|
|
|
## Intended Uses & Limitations |
|
|
|
|
|
### Intended Uses: |
|
|
- Spam filtering in multilingual messaging systems |
|
|
- Research on cross-lingual text classification |
|
|
- Transfer learning studies involving high/low-resource languages |
|
|
|
|
|
### ⚠️ Limitations: |
|
|
- Trained on English and Hindi; evaluated on German and French. While results on French were promising, further benchmarking is recommended. |
|
|
- May underperform on mixed-language, informal, or code-switched inputs. |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
## 📊 Performance |
|
|
|
|
|
### Evaluation Metrics (on German test set): |
|
|
|
|
|
| Metric | Score | |
|
|
|----------------|-------| |
|
|
| Accuracy | 0.99 | |
|
|
| Precision (ham)| 1.00 | |
|
|
| Recall (ham) | 0.99 | |
|
|
| F1-score (ham) | 1.00 | |
|
|
| Precision (spam)| 0.97 | |
|
|
| Recall (spam) | 0.98 | |
|
|
| F1-score (spam)| 0.97 | |
|
|
| Weighted F1 | 0.99 | |
|
|
|
|
|
### Confusion Matrix: |
|
|
|
|
|
| | Predicted Ham | Predicted Spam | |
|
|
|----------------|---------------|----------------| |
|
|
| **Actual Ham** | **47956** | 20 | |
|
|
| **Actual Spam**| 22 | **7533** | |
|
|
|
|
|
--- |
|
|
|
|
|
## 📚 Dataset |
|
|
|
|
|
The dataset is a multilingual corpus with parallel spam/ham messages in: |
|
|
- English |
|
|
- Hindi |
|
|
- German |
|
|
- French |
|
|
|
|
|
For this training run: |
|
|
- **Train set**: English + Hindi |
|
|
- **Test set**: German |
|
|
|
|
|
Labels: |
|
|
- `"ham"` → 0 |
|
|
- `"spam"` → 1 |
|
|
|
|
|
--- |
|
|
|
|
|
## ⚙️ Training Configuration |
|
|
|
|
|
| Setting | Value | |
|
|
|-----------------------|------------------| |
|
|
| Epochs | 3 | |
|
|
| Batch Size | 32 (train), 8 (eval) | |
|
|
| Learning Rate | 3e-5 | |
|
|
| Optimizer | AdamW | |
|
|
| Weight Decay | 0.01 | |
|
|
| Scheduler | Linear | |
|
|
| Eval Strategy | Epoch-wise | |
|
|
|
|
|
--- |
|
|
|
|
|
## Example Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
import torch |
|
|
|
|
|
model_name = "Xtiphyn/Cross-Lingual-Spam-Filter" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
|
|
|
|
inputs = tokenizer("Sie haben eine kostenlose Reise gewonnen!", return_tensors="pt") |
|
|
with torch.no_grad(): |
|
|
logits = model(**inputs).logits |
|
|
prediction = torch.argmax(logits).item() |
|
|
|
|
|
print("Label:", "Spam" if prediction == 1 else "Ham") |
|
|
|
|
|
|
|
|
🛠️ Environment |
|
|
Transformers: 4.54.0 |
|
|
|
|
|
PyTorch: 2.6.0+cu124 |
|
|
|
|
|
Datasets: 4.0.0 |
|
|
|
|
|
Tokenizers: 0.21.2 |
|
|
|
|
|
🚧 Future Work |
|
|
Incorporate code-switching and low-resource scripts |
|
|
|
|
|
Made with ❤️ by Xtiphyn. |