Xtiphyn
/

Cross-Lingual-Spam-Filter

Text Classification

Generated from Trainer

Model card Files Files and versions

Cross-Lingual-Spam-Filter / README.md

Xtiphyn's picture

Update README.md

635b308 verified 4 months ago

|

history blame contribute delete

3.53 kB

	---
	library_name: transformers
	license: mit
	base_model: xlm-roberta-base
	tags:
	- multilingual
	- spam-detection
	- cross-lingual
	- transformers
	- huggingface
	- text-classification
	- generated_from_trainer
	model-index:
	- name: XLM-Roberta Spam Classifier (EN-HI ➝ DE-FR)
	results: []
	metrics:
	- accuracy
	- f1
	pipeline_tag: text-classification
	---

	# XLM-Roberta Spam Classifier (EN-HI ➝ DE)

	This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) for cross-lingual spam detection. It was trained on English and Hindi messages, and evaluated on German samples. The goal is to demonstrate zero-shot transfer in spam/ham classification across languages.

	---

	## 🧠 Model Description

	- Model Type: XLM-RoBERTa Base (Transformer encoder)
	- Task: Binary classification – spam vs. ham
	- Languages: Trained on English and Hindi, tested on German
	- Tokenizer: AutoTokenizer from `transformers` (`xlm-roberta-base`)
	- Framework: PyTorch + Hugging Face `Trainer`

	---

	## Intended Uses & Limitations

	### Intended Uses:
	- Spam filtering in multilingual messaging systems
	- Research on cross-lingual text classification
	- Transfer learning studies involving high/low-resource languages

	### ⚠️ Limitations:
	- Trained on English and Hindi; evaluated on German and French. While results on French were promising, further benchmarking is recommended.
	- May underperform on mixed-language, informal, or code-switched inputs.


	---

	## 📊 Performance

	### Evaluation Metrics (on German test set):

	\| Metric \| Score \|
	\|----------------\|-------\|
	\| Accuracy \| 0.99 \|
	\| Precision (ham)\| 1.00 \|
	\| Recall (ham) \| 0.99 \|
	\| F1-score (ham) \| 1.00 \|
	\| Precision (spam)\| 0.97 \|
	\| Recall (spam) \| 0.98 \|
	\| F1-score (spam)\| 0.97 \|
	\| Weighted F1 \| 0.99 \|

	### Confusion Matrix:

	\| \| Predicted Ham \| Predicted Spam \|
	\|----------------\|---------------\|----------------\|
	\| Actual Ham \| 47956 \| 20 \|
	\| Actual Spam\| 22 \| 7533 \|

	---

	## 📚 Dataset

	The dataset is a multilingual corpus with parallel spam/ham messages in:
	- English
	- Hindi
	- German
	- French

	For this training run:
	- Train set: English + Hindi
	- Test set: German

	Labels:
	- `"ham"` → 0
	- `"spam"` → 1

	---

	## ⚙️ Training Configuration

	\| Setting \| Value \|
	\|-----------------------\|------------------\|
	\| Epochs \| 3 \|
	\| Batch Size \| 32 (train), 8 (eval) \|
	\| Learning Rate \| 3e-5 \|
	\| Optimizer \| AdamW \|
	\| Weight Decay \| 0.01 \|
	\| Scheduler \| Linear \|
	\| Eval Strategy \| Epoch-wise \|

	---

	## Example Usage

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	model_name = "Xtiphyn/Cross-Lingual-Spam-Filter"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	inputs = tokenizer("Sie haben eine kostenlose Reise gewonnen!", return_tensors="pt")
	with torch.no_grad():
	logits = model(**inputs).logits
	prediction = torch.argmax(logits).item()

	print("Label:", "Spam" if prediction == 1 else "Ham")


	🛠️ Environment
	Transformers: 4.54.0

	PyTorch: 2.6.0+cu124

	Datasets: 4.0.0

	Tokenizers: 0.21.2

	🚧 Future Work
	Incorporate code-switching and low-resource scripts

	Made with ❤️ by Xtiphyn.