0408happyfeet/distilbert-superhero-universe-classifier

Model description

This repository hosts DistilBERT (distilbert-base-uncased) fine‑tuned on a classmate dataset, rlogh/superhero-texts, to classify short superhero descriptions by publisher across four classes: DC, Marvel, Dark Horse, and Image.

Intended uses & limitations

Intended uses: quick labeling of short superhero‑related texts by publisher (DC, Marvel, Dark Horse, Image); course demonstrations and tutorials.
Limitations / not intended uses:
- Safety‑critical decisions or any use demanding exhaustive factual correctness.
- Domains outside superheroes; the model may rely on explicit named entities (e.g., “Gotham”, “Avengers”) and fail on neutral/ambiguous text.
- Small, student‑authored dataset ⇒ potential stylistic bias and limited generalization.

Training and evaluation data

We use rlogh/superhero-texts with two splits:

augmented (~1.1k rows): split into train/val/test = 80/10/10 (stratified).
original (100 rows): used optionally as an external/OOD evaluation set to gauge generalization.

Labels are mapped to integers via {'DC': 0, 'Dark horse': 1, 'Image': 2, 'Marvel': 3}.

Preprocessing

Tokenizer: distilbert-base-uncased (uncased); truncation to max_length=256; dynamic padding with DataCollatorWithPadding.
Minimal text normalization; no additional cleaning.

Training procedure

Training hyperparameters

Seed: 42; Epochs: 4; LR: 2e-05; Per‑device batch size: 16
Optimizer/scheduler: defaults via Trainer (AdamW + linear sched). (If used) early stopping with patience = 2.
Strategies: evaluation_strategy='epoch', save_strategy='epoch', load_best_model_at_end=True.

Hardware & compute

Hardware: Tesla T4
Precision: bf16=False
Libraries: transformers 4.56.2, datasets 4.1.1, evaluate 0.4.6, accelerate 1.10.1, huggingface_hub 0.35.1
Python: 3.12.11 ; PyTorch: 2.8.0+cu126
Trainable parameters: ~66,956,548

Evaluation results

Augmented test — accuracy: 0.9818, precision: 0.6835, recall: 0.6875, f1: 0.6855. External/OOD (original split) — accuracy: 1.0000, precision: 1.0000, recall: 1.0000, f1: 1.0000

Confusion Matrix (Augmented Test)

true\pred	pred=DC	pred=Image	pred=Marvel
true=DC	44	0	0
true=Dark horse	0	1	0
true=Image	0	3	1
true=Marvel	0	0	61

Brief error analysis

On the augmented test set (2 / 110 errors; 98.18% accuracy), mistakes cluster in the minority classes and in texts with heavy misspellings or cross‑franchise references. Two concrete examples:

True: Dark Horse → Pred: Image "Hellboy als ohas some pretty well made moies. I was introduced to him through the movies nd seeing the way that he cares for Humanity even though he's not necessarily human is very compelling." Hypothesis: strong cue “Hellboy” is present but minority‑class underrepresentation + heavy typos (e.g., als ohas, moies) weaken the signal; model over‑indexes on generic superhero vocabulary it associates with Image.
True: Image → Pred: Marvel "Atom Eve also has a very nuique power set. Similar to Firestormshe can change the atoms of inaniamte objects but not living biegs. for ilmitations don't seem to exist wen shes in a life or death state thoug.h" Hypothesis: severe misspellings (nuique, inaniamte, Firestormshe) and cross‑franchise mention (“Firestorm”, a DC character) blur the publisher signal; generic power‑set language drifts toward the majority Marvel class.

Known hard cases: (i) minority publishers (Dark Horse, Image), (ii) texts without explicit publisher/franchise cues (“Gotham,” “Avengers,” “Invincible”), (iii) heavy misspellings and concatenations that break tokenization (“Firestormshe”). Next steps: add spell‑noise augmentation, include more Dark Horse & Image examples (e.g., B.P.R.D., Invincible universe terms), and consider class‑balanced loss.

How to use

from transformers import pipeline
clf = pipeline("text-classification", model="0408happyfeet/distilbert-superhero-universe-classifier")
print(clf([
    "Bruce patrols Gotham as part of the Justice League.",
    "Thor of Asgard fights with the Avengers."
]))

Limitations & ethical considerations

Narrow domain; not suitable for sensitive or consequential uses. May encode bias toward named entities and author style. Users should validate outputs on their own data distributions.

License

Apache‑2.0 (dataset and model).

AI usage disclosure

This training notebook and model card were authored by the student and assisted by ChaptGPT for boilerplate, diagnostics, and documentation improvements. Final decisions, hyperparameters, and evaluation were reviewed by the student.

Downloads last month: 17

Safetensors

Model size

67M params

Tensor type

F32

Model tree for 0408happyfeet/distilbert-superhero-universe-classifier

Base model

distilbert/distilbert-base-uncased

Finetuned

(10071)

this model

Dataset used to train 0408happyfeet/distilbert-superhero-universe-classifier

Evaluation results

accuracy on rlogh/superhero-texts (augmented test)
self-reported

0.982
precision on rlogh/superhero-texts (augmented test)
self-reported

0.683
recall on rlogh/superhero-texts (augmented test)
self-reported

0.688
f1 on rlogh/superhero-texts (augmented test)
self-reported

0.685

View on Papers With Code