TinyBERT Spam Classifier (Enron)
A compact TinyBERT (4-layer, 312 hidden) model fine-tuned to classify email text as spam or ham.
Trained on an Enron-derived CSV with light email-specific cleaning (e.g., removing quoted lines and base64-like blobs).
Optimized for low false positives by default; adjust the decision threshold if you want higher spam recall.
Labels:
ham
(0) andspam
(1)
β¨ Quick Start
from transformers import pipeline
clf = pipeline(
"text-classification",
model="prancyFox/tiny-bert-enron-spam",
truncation=True # recommended for long emails
)
clf("Congratulations! You won a FREE iPhone. Click here now!")
# [{'label': 'spam', 'score': 0.98}]
Batch inference
texts = [
"Meeting moved to 3pm, see agenda attached.",
"FREE gift card!!! Act now!",
]
preds = clf(texts, truncation=True)
π Intended Use & Limitations
Intended use
- Classifying email bodies (and optionally subject+body) as spam vs ham.
- Low-latency scenarios where a small model is preferred.
Out of scope / Limitations
- Non-English email content may reduce accuracy.
- Long threads with heavy quoting/footers can dilute signal (use truncation + cleaning).
- Trained on Enron-style corporate emails; consumer emails may differ (consider further fine-tuning).
π§° How We Preprocessed the Data
Light normalization aimed at keeping semantic content:
- Remove long base64-like blobs.
- Drop quoted lines starting with
>
or|
. - Optional: concatenate
Subject + "\n" + Message
when available. - Collapse repeated whitespace.
(You can replicate similar cleaning in your serving pipeline for alignment.)
ποΈ Training Details
- Base model:
huawei-noah/TinyBERT_General_4L_312D
- Task: Binary text classification (
ham
=0,spam
=1) - Tokenizer: fast BERT tokenizer (uncased)
- Max length: 256 tokens
- Optimizer / LR: AdamW, learning rate
2e-5 β 5e-5
(final run3e-5
) - Batch size: 32
- Epochs: 4 (early stopping enabled)
- Warmup: 10%
- Weight decay: 0.01
- Loss: Cross-entropy with class weighting (ham/spam balanced from label distribution). Focal loss available in the trainer.
- Early stopping metric:
eval_f1
- Best checkpoint: Saved using evaluation on validation set.
Trainer script:
train/train_tinybert.py
(TinyBERT-compatible, with legacy HF support shims).
π Evaluation (Chunked Benchmark Summary)
Metrics below reflect a chunked evaluation pass (used for long emails), where the model sees up to 512 tokens per chunk with overlap. Threshold tuned to minimize false positives:
Classification Report
Class | Precision | Recall | F1 |
---|---|---|---|
ham | 0.6875 | 0.9973 | 0.8139 |
spam | 0.9954 | 0.5632 | 0.7194 |
macro avg | 0.8414 | 0.7802 | 0.7666 |
- ROC-AUC: 0.9977
Confusion matrix
[[16500 45]
[ 7500 9671]]
Interpretation: The model is conservative (very few false positives on ham). If you need to catch more spam, lower the decision threshold (e.g., from 0.5 β 0.35) or re-train with a spam-skewed class weight / focal loss.
ποΈ Threshold & Long-Email Guidance
- Threshold: Default is 0.5. For higher spam recall, try 0.35β0.45 and evaluate impact on false positives.
- Long emails: For multi-paragraph threads, consider chunking and aggregating chunk-level spam scores (e.g., max or average). Our reference app uses 512-token chunks with overlap.
π§ͺ Reproducibility
Environment
- Python 3.10/3.11
transformers >= 4.40
datasets >= 2.20
evaluate >= 0.4.2
torch >= 2.1
Training command (example)
python train/train_tinybert.py \
--train data/enron.csv \
--text_col Message --label_col "Spam/Ham" \
--output_dir outputs/tiny-bert-enron-spam \
--epochs 4 --batch_size 32 --lr 3e-5 \
--max_length 256 --fp16
Serving (FastAPI example)
python spam_bert.py --serve \
--model prancyFox/tiny-bert-enron-spam \
--model-cache-dir ./models_cache
π Files
This repo should include:
config.json
pytorch_model.bin
ormodel.safetensors
tokenizer.json
andtokenizer_config.json
(orvocab.txt
etc.)README.md
(this file)- (Optional)
label_mapping.json
with{"ham": 0, "spam": 1}
βοΈ License
- Model weights & code: MIT
- Dataset: Check the original Enron dataset/license terms before redistribution.
π¬ Ethical Considerations & Risks
- False positives can have operational cost (missed legitimate emails). This model is tuned to minimize them; if you change the threshold, validate carefully.
- Spam evolves. Periodically re-train with fresh samples to maintain accuracy.
- Non-English or code-mixed content may degrade performance.
π§© Citation
If you use this model, please cite:
@software{tinybert_enron_spam_2025,
title = {TinyBERT Spam Classifier (Enron)},
author = {Ing. Daniel Eder},
year = {2025},
url = {https://huggingface.co/prancyFox/tiny-bert-enron-spam}
}
And the TinyBERT paper:
@article{jiao2020tinybert,
title={TinyBERT: Distilling BERT for Natural Language Understanding},
author={Jiao, Xiaoqi and Yin, Yichun and others},
journal={Findings of EMNLP},
year={2020}
}
π Maintainers
- Daniel Eder ([email protected])
Notes
- For a higher-recall variant, fine-tune with
--use_focal_loss
or increase the spam class weight, then re-evaluate thresholds. - If you want a PyTorch Lightning or Accelerate training variant, ~itβs easy to adapt the provided trainer.
- Downloads last month
- 5
Model tree for prancyFox/tiny-bert-enron-spam
Base model
huawei-noah/TinyBERT_General_4L_312DEvaluation results
- F1 (macro) on Enron (processed CSV)test set self-reported0.767
- ROC-AUC on Enron (processed CSV)test set self-reported0.998
- Precision (spam) on Enron (processed CSV)test set self-reported0.995
- Recall (spam) on Enron (processed CSV)test set self-reported0.563
- Precision (ham) on Enron (processed CSV)test set self-reported0.688
- Recall (ham) on Enron (processed CSV)test set self-reported0.997