0408happyfeet/distilbert-superhero-universe-classifier
Model description
This repository hosts DistilBERT (distilbert-base-uncased) fine‑tuned on a classmate dataset, rlogh/superhero-texts, to classify short superhero descriptions by publisher across four classes: DC, Marvel, Dark Horse, and Image.
Intended uses & limitations
- Intended uses: quick labeling of short superhero‑related texts by publisher (DC, Marvel, Dark Horse, Image); course demonstrations and tutorials.
- Limitations / not intended uses:
- Safety‑critical decisions or any use demanding exhaustive factual correctness.
- Domains outside superheroes; the model may rely on explicit named entities (e.g., “Gotham”, “Avengers”) and fail on neutral/ambiguous text.
- Small, student‑authored dataset ⇒ potential stylistic bias and limited generalization.
Training and evaluation data
We use rlogh/superhero-texts with two splits:
augmented(~1.1k rows): split into train/val/test = 80/10/10 (stratified).original(100 rows): used optionally as an external/OOD evaluation set to gauge generalization.
Labels are mapped to integers via {'DC': 0, 'Dark horse': 1, 'Image': 2, 'Marvel': 3}.
Preprocessing
- Tokenizer:
distilbert-base-uncased(uncased); truncation to max_length=256; dynamic padding withDataCollatorWithPadding. - Minimal text normalization; no additional cleaning.
Training procedure
Training hyperparameters
- Seed: 42; Epochs: 4; LR: 2e-05; Per‑device batch size: 16
- Optimizer/scheduler: defaults via
Trainer(AdamW + linear sched). (If used) early stopping with patience = 2. - Strategies:
evaluation_strategy='epoch',save_strategy='epoch',load_best_model_at_end=True.
Hardware & compute
- Hardware: Tesla T4
- Precision: bf16=False
- Libraries: transformers 4.56.2, datasets 4.1.1, evaluate 0.4.6, accelerate 1.10.1, huggingface_hub 0.35.1
- Python: 3.12.11 ; PyTorch: 2.8.0+cu126
- Trainable parameters: ~66,956,548
Evaluation results
Augmented test — accuracy: 0.9818, precision: 0.6835, recall: 0.6875, f1: 0.6855. External/OOD (original split) — accuracy: 1.0000, precision: 1.0000, recall: 1.0000, f1: 1.0000
Confusion Matrix (Augmented Test)
| true\pred | pred=DC | pred=Dark horse | pred=Image | pred=Marvel |
|---|---|---|---|---|
| true=DC | 44 | 0 | 0 | 0 |
| true=Dark horse | 0 | 0 | 1 | 0 |
| true=Image | 0 | 0 | 3 | 1 |
| true=Marvel | 0 | 0 | 0 | 61 |
Brief error analysis
On the augmented test set (2 / 110 errors; 98.18% accuracy), mistakes cluster in the minority classes and in texts with heavy misspellings or cross‑franchise references. Two concrete examples:
True: Dark Horse → Pred: Image "Hellboy als ohas some pretty well made moies. I was introduced to him through the movies nd seeing the way that he cares for Humanity even though he's not necessarily human is very compelling." Hypothesis: strong cue “Hellboy” is present but minority‑class underrepresentation + heavy typos (e.g., als ohas, moies) weaken the signal; model over‑indexes on generic superhero vocabulary it associates with Image.
True: Image → Pred: Marvel "Atom Eve also has a very nuique power set. Similar to Firestormshe can change the atoms of inaniamte objects but not living biegs. for ilmitations don't seem to exist wen shes in a life or death state thoug.h" Hypothesis: severe misspellings (nuique, inaniamte, Firestormshe) and cross‑franchise mention (“Firestorm”, a DC character) blur the publisher signal; generic power‑set language drifts toward the majority Marvel class.
Known hard cases: (i) minority publishers (Dark Horse, Image), (ii) texts without explicit publisher/franchise cues (“Gotham,” “Avengers,” “Invincible”), (iii) heavy misspellings and concatenations that break tokenization (“Firestormshe”). Next steps: add spell‑noise augmentation, include more Dark Horse & Image examples (e.g., B.P.R.D., Invincible universe terms), and consider class‑balanced loss.
How to use
from transformers import pipeline
clf = pipeline("text-classification", model="0408happyfeet/distilbert-superhero-universe-classifier")
print(clf([
"Bruce patrols Gotham as part of the Justice League.",
"Thor of Asgard fights with the Avengers."
]))
Limitations & ethical considerations
Narrow domain; not suitable for sensitive or consequential uses. May encode bias toward named entities and author style. Users should validate outputs on their own data distributions.
License
Apache‑2.0 (dataset and model).
AI usage disclosure
This training notebook and model card were authored by the student and assisted by ChaptGPT for boilerplate, diagnostics, and documentation improvements. Final decisions, hyperparameters, and evaluation were reviewed by the student.
- Downloads last month
- 17
Model tree for 0408happyfeet/distilbert-superhero-universe-classifier
Base model
distilbert/distilbert-base-uncasedDataset used to train 0408happyfeet/distilbert-superhero-universe-classifier
Evaluation results
- accuracy on rlogh/superhero-texts (augmented test)self-reported0.982
- precision on rlogh/superhero-texts (augmented test)self-reported0.683
- recall on rlogh/superhero-texts (augmented test)self-reported0.688
- f1 on rlogh/superhero-texts (augmented test)self-reported0.685
