Xtiphyn's picture
Update README.md
635b308 verified
---
library_name: transformers
license: mit
base_model: xlm-roberta-base
tags:
- multilingual
- spam-detection
- cross-lingual
- transformers
- huggingface
- text-classification
- generated_from_trainer
model-index:
- name: XLM-Roberta Spam Classifier (EN-HI DE-FR)
results: []
metrics:
- accuracy
- f1
pipeline_tag: text-classification
---
# XLM-Roberta Spam Classifier (EN-HI ➝ DE)
This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) for cross-lingual spam detection. It was trained on **English** and **Hindi** messages, and evaluated on **German** samples. The goal is to demonstrate zero-shot transfer in spam/ham classification across languages.
---
## 🧠 Model Description
- **Model Type**: XLM-RoBERTa Base (Transformer encoder)
- **Task**: Binary classification – spam vs. ham
- **Languages**: Trained on English and Hindi, tested on German
- **Tokenizer**: AutoTokenizer from `transformers` (`xlm-roberta-base`)
- **Framework**: PyTorch + Hugging Face `Trainer`
---
## Intended Uses & Limitations
### Intended Uses:
- Spam filtering in multilingual messaging systems
- Research on cross-lingual text classification
- Transfer learning studies involving high/low-resource languages
### ⚠️ Limitations:
- Trained on English and Hindi; evaluated on German and French. While results on French were promising, further benchmarking is recommended.
- May underperform on mixed-language, informal, or code-switched inputs.
---
## 📊 Performance
### Evaluation Metrics (on German test set):
| Metric | Score |
|----------------|-------|
| Accuracy | 0.99 |
| Precision (ham)| 1.00 |
| Recall (ham) | 0.99 |
| F1-score (ham) | 1.00 |
| Precision (spam)| 0.97 |
| Recall (spam) | 0.98 |
| F1-score (spam)| 0.97 |
| Weighted F1 | 0.99 |
### Confusion Matrix:
| | Predicted Ham | Predicted Spam |
|----------------|---------------|----------------|
| **Actual Ham** | **47956** | 20 |
| **Actual Spam**| 22 | **7533** |
---
## 📚 Dataset
The dataset is a multilingual corpus with parallel spam/ham messages in:
- English
- Hindi
- German
- French
For this training run:
- **Train set**: English + Hindi
- **Test set**: German
Labels:
- `"ham"` → 0
- `"spam"` → 1
---
## ⚙️ Training Configuration
| Setting | Value |
|-----------------------|------------------|
| Epochs | 3 |
| Batch Size | 32 (train), 8 (eval) |
| Learning Rate | 3e-5 |
| Optimizer | AdamW |
| Weight Decay | 0.01 |
| Scheduler | Linear |
| Eval Strategy | Epoch-wise |
---
## Example Usage
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "Xtiphyn/Cross-Lingual-Spam-Filter"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
inputs = tokenizer("Sie haben eine kostenlose Reise gewonnen!", return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
prediction = torch.argmax(logits).item()
print("Label:", "Spam" if prediction == 1 else "Ham")
🛠️ Environment
Transformers: 4.54.0
PyTorch: 2.6.0+cu124
Datasets: 4.0.0
Tokenizers: 0.21.2
🚧 Future Work
Incorporate code-switching and low-resource scripts
Made with ❤️ by Xtiphyn.