0408happyfeet/distilbert-superhero-universe-classifier

Model description

This repository hosts DistilBERT (distilbert-base-uncased) fine‑tuned on a classmate dataset, rlogh/superhero-texts, to classify short superhero descriptions by publisher across four classes: DC, Marvel, Dark Horse, and Image.

Intended uses & limitations

  • Intended uses: quick labeling of short superhero‑related texts by publisher (DC, Marvel, Dark Horse, Image); course demonstrations and tutorials.
  • Limitations / not intended uses:
    • Safety‑critical decisions or any use demanding exhaustive factual correctness.
    • Domains outside superheroes; the model may rely on explicit named entities (e.g., “Gotham”, “Avengers”) and fail on neutral/ambiguous text.
    • Small, student‑authored dataset ⇒ potential stylistic bias and limited generalization.

Training and evaluation data

We use rlogh/superhero-texts with two splits:

  • augmented (~1.1k rows): split into train/val/test = 80/10/10 (stratified).
  • original (100 rows): used optionally as an external/OOD evaluation set to gauge generalization.

Labels are mapped to integers via {'DC': 0, 'Dark horse': 1, 'Image': 2, 'Marvel': 3}.

Preprocessing

  • Tokenizer: distilbert-base-uncased (uncased); truncation to max_length=256; dynamic padding with DataCollatorWithPadding.
  • Minimal text normalization; no additional cleaning.

Training procedure

Training hyperparameters

  • Seed: 42; Epochs: 4; LR: 2e-05; Per‑device batch size: 16
  • Optimizer/scheduler: defaults via Trainer (AdamW + linear sched). (If used) early stopping with patience = 2.
  • Strategies: evaluation_strategy='epoch', save_strategy='epoch', load_best_model_at_end=True.

Hardware & compute

  • Hardware: Tesla T4
  • Precision: bf16=False
  • Libraries: transformers 4.56.2, datasets 4.1.1, evaluate 0.4.6, accelerate 1.10.1, huggingface_hub 0.35.1
  • Python: 3.12.11 ; PyTorch: 2.8.0+cu126
  • Trainable parameters: ~66,956,548

Evaluation results

Augmented test — accuracy: 0.9818, precision: 0.6835, recall: 0.6875, f1: 0.6855. External/OOD (original split) — accuracy: 1.0000, precision: 1.0000, recall: 1.0000, f1: 1.0000

Confusion Matrix (Augmented Test)

Confusion Matrix

true\pred pred=DC pred=Dark horse pred=Image pred=Marvel
true=DC 44 0 0 0
true=Dark horse 0 0 1 0
true=Image 0 0 3 1
true=Marvel 0 0 0 61

Brief error analysis

On the augmented test set (2 / 110 errors; 98.18% accuracy), mistakes cluster in the minority classes and in texts with heavy misspellings or cross‑franchise references. Two concrete examples:

  • True: Dark Horse → Pred: Image "Hellboy als ohas some pretty well made moies. I was introduced to him through the movies nd seeing the way that he cares for Humanity even though he's not necessarily human is very compelling." Hypothesis: strong cue “Hellboy” is present but minority‑class underrepresentation + heavy typos (e.g., als ohas, moies) weaken the signal; model over‑indexes on generic superhero vocabulary it associates with Image.

  • True: Image → Pred: Marvel "Atom Eve also has a very nuique power set. Similar to Firestormshe can change the atoms of inaniamte objects but not living biegs. for ilmitations don't seem to exist wen shes in a life or death state thoug.h" Hypothesis: severe misspellings (nuique, inaniamte, Firestormshe) and cross‑franchise mention (“Firestorm”, a DC character) blur the publisher signal; generic power‑set language drifts toward the majority Marvel class.

Known hard cases: (i) minority publishers (Dark Horse, Image), (ii) texts without explicit publisher/franchise cues (“Gotham,” “Avengers,” “Invincible”), (iii) heavy misspellings and concatenations that break tokenization (“Firestormshe”). Next steps: add spell‑noise augmentation, include more Dark Horse & Image examples (e.g., B.P.R.D., Invincible universe terms), and consider class‑balanced loss.

How to use

from transformers import pipeline
clf = pipeline("text-classification", model="0408happyfeet/distilbert-superhero-universe-classifier")
print(clf([
    "Bruce patrols Gotham as part of the Justice League.",
    "Thor of Asgard fights with the Avengers."
]))

Limitations & ethical considerations

Narrow domain; not suitable for sensitive or consequential uses. May encode bias toward named entities and author style. Users should validate outputs on their own data distributions.

License

Apache‑2.0 (dataset and model).

AI usage disclosure

This training notebook and model card were authored by the student and assisted by ChaptGPT for boilerplate, diagnostics, and documentation improvements. Final decisions, hyperparameters, and evaluation were reviewed by the student.

Downloads last month
17
Safetensors
Model size
67M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 0408happyfeet/distilbert-superhero-universe-classifier

Finetuned
(10071)
this model

Dataset used to train 0408happyfeet/distilbert-superhero-universe-classifier

Evaluation results