TinyBERT Spam Classifier (Enron)

A compact TinyBERT (4-layer, 312 hidden) model fine-tuned to classify email text as spam or ham.
Trained on an Enron-derived CSV with light email-specific cleaning (e.g., removing quoted lines and base64-like blobs).
Optimized for low false positives by default; adjust the decision threshold if you want higher spam recall.

Labels: ham (0) and spam (1)


✨ Quick Start

from transformers import pipeline

clf = pipeline(
    "text-classification",
    model="prancyFox/tiny-bert-enron-spam",
    truncation=True  # recommended for long emails
)

clf("Congratulations! You won a FREE iPhone. Click here now!")
# [{'label': 'spam', 'score': 0.98}]

Batch inference

texts = [
    "Meeting moved to 3pm, see agenda attached.",
    "FREE gift card!!! Act now!",
]
preds = clf(texts, truncation=True)

πŸ”Ž Intended Use & Limitations

Intended use

  • Classifying email bodies (and optionally subject+body) as spam vs ham.
  • Low-latency scenarios where a small model is preferred.

Out of scope / Limitations

  • Non-English email content may reduce accuracy.
  • Long threads with heavy quoting/footers can dilute signal (use truncation + cleaning).
  • Trained on Enron-style corporate emails; consumer emails may differ (consider further fine-tuning).

🧰 How We Preprocessed the Data

Light normalization aimed at keeping semantic content:

  • Remove long base64-like blobs.
  • Drop quoted lines starting with > or |.
  • Optional: concatenate Subject + "\n" + Message when available.
  • Collapse repeated whitespace.

(You can replicate similar cleaning in your serving pipeline for alignment.)


πŸ‹οΈ Training Details

  • Base model: huawei-noah/TinyBERT_General_4L_312D
  • Task: Binary text classification (ham=0, spam=1)
  • Tokenizer: fast BERT tokenizer (uncased)
  • Max length: 256 tokens
  • Optimizer / LR: AdamW, learning rate 2e-5 – 5e-5 (final run 3e-5)
  • Batch size: 32
  • Epochs: 4 (early stopping enabled)
  • Warmup: 10%
  • Weight decay: 0.01
  • Loss: Cross-entropy with class weighting (ham/spam balanced from label distribution). Focal loss available in the trainer.
  • Early stopping metric: eval_f1
  • Best checkpoint: Saved using evaluation on validation set.

Trainer script: train/train_tinybert.py (TinyBERT-compatible, with legacy HF support shims).


πŸ“Š Evaluation (Chunked Benchmark Summary)

Metrics below reflect a chunked evaluation pass (used for long emails), where the model sees up to 512 tokens per chunk with overlap. Threshold tuned to minimize false positives:

Classification Report

Class Precision Recall F1
ham 0.6875 0.9973 0.8139
spam 0.9954 0.5632 0.7194
macro avg 0.8414 0.7802 0.7666
  • ROC-AUC: 0.9977

Confusion matrix

[[16500    45]
 [ 7500  9671]]

Interpretation: The model is conservative (very few false positives on ham). If you need to catch more spam, lower the decision threshold (e.g., from 0.5 β†’ 0.35) or re-train with a spam-skewed class weight / focal loss.


πŸŽ›οΈ Threshold & Long-Email Guidance

  • Threshold: Default is 0.5. For higher spam recall, try 0.35–0.45 and evaluate impact on false positives.
  • Long emails: For multi-paragraph threads, consider chunking and aggregating chunk-level spam scores (e.g., max or average). Our reference app uses 512-token chunks with overlap.

πŸ§ͺ Reproducibility

Environment

  • Python 3.10/3.11
  • transformers >= 4.40
  • datasets >= 2.20
  • evaluate >= 0.4.2
  • torch >= 2.1

Training command (example)

python train/train_tinybert.py \
  --train data/enron.csv \
  --text_col Message --label_col "Spam/Ham" \
  --output_dir outputs/tiny-bert-enron-spam \
  --epochs 4 --batch_size 32 --lr 3e-5 \
  --max_length 256 --fp16

Serving (FastAPI example)

python spam_bert.py --serve \
  --model prancyFox/tiny-bert-enron-spam \
  --model-cache-dir ./models_cache

πŸ“ Files

This repo should include:

  • config.json
  • pytorch_model.bin or model.safetensors
  • tokenizer.json and tokenizer_config.json (or vocab.txt etc.)
  • README.md (this file)
  • (Optional) label_mapping.json with {"ham": 0, "spam": 1}

βš–οΈ License

  • Model weights & code: MIT
  • Dataset: Check the original Enron dataset/license terms before redistribution.

πŸ”¬ Ethical Considerations & Risks

  • False positives can have operational cost (missed legitimate emails). This model is tuned to minimize them; if you change the threshold, validate carefully.
  • Spam evolves. Periodically re-train with fresh samples to maintain accuracy.
  • Non-English or code-mixed content may degrade performance.

🧩 Citation

If you use this model, please cite:

@software{tinybert_enron_spam_2025,
  title        = {TinyBERT Spam Classifier (Enron)},
  author       = {Ing. Daniel Eder},
  year         = {2025},
  url          = {https://huggingface.co/prancyFox/tiny-bert-enron-spam}
}

And the TinyBERT paper:

@article{jiao2020tinybert,
  title={TinyBERT: Distilling BERT for Natural Language Understanding},
  author={Jiao, Xiaoqi and Yin, Yichun and others},
  journal={Findings of EMNLP},
  year={2020}
}

πŸ›  Maintainers


Notes

  • For a higher-recall variant, fine-tune with --use_focal_loss or increase the spam class weight, then re-evaluate thresholds.
  • If you want a PyTorch Lightning or Accelerate training variant, ~it’s easy to adapt the provided trainer.
Downloads last month
5
Safetensors
Model size
14.4M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for prancyFox/tiny-bert-enron-spam

Finetuned
(32)
this model

Evaluation results