Use Model
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
identity_model = AutoModelForSequenceClassification.from_pretrained("Mridul2003/identity-hate-detector").to(device)
identity_tokenizer = AutoTokenizer.from_pretrained("Mridul2003/identity-hate-detector")
identity_inputs = identity_tokenizer(final_text, return_tensors="pt", padding=True, truncation=True)
if 'token_type_ids' in identity_inputs:
del identity_inputs['token_type_ids']
identity_inputs = {k: v.to(device) for k, v in identity_inputs.items()}
with torch.no_grad():
identity_outputs = identity_model(**identity_inputs)
identity_probs = torch.sigmoid(identity_outputs.logits)
identity_prob = identity_probs[0][1].item()
not_identity_prob = identity_probs[0][0].item()
results["identity_hate_custom"] = identity_prob
results["not_identity_hate_custom"] = not_identity_prob
Offensive Language Classifier (Fine-Tuned on Custom Dataset)
This repository contains a fine-tuned version of the unitary/toxic-bert
model for binary classification of offensive language (labels: Offensive
vs Not Offensive
). The model has been specifically fine-tuned on a custom dataset due to limitations observed in the base model's performance β particularly with identity_hate
related content.
π Problem with Base Model (unitary/toxic-bert
)
The original unitary/toxic-bert
model is trained for multi-label toxicity detection with 6 categories:
- toxic
- severe toxic
- obscene
- threat
- insult
- identity_hate
While it performs reasonably well on generic toxicity, it struggles with edge cases involving identity-based hate speech β often:
- Misclassifying subtle or sarcastic identity attacks
- Underestimating offensive content with identity-specific slurs
β Why Fine-Tune?
We fine-tuned the model on a custom annotated dataset with two clear labels:
0
: Not Identity Hate1
: Identity Hate
The new model simplifies the task into a binary classification problem, allowing more focused training for real-world moderation scenarios.
π Dataset Overview
- Total examples: ~4,000+
- Balanced between offensive and non-offensive labels
- Contains high proportions of
identity_hate
,obscene
,insult
, and more nuanced samples
π§ Model Details
- Base model:
unitary/toxic-bert
- Fine-tuned using: Hugging Face π€
Trainer
API - Loss function: CrossEntropyLoss (via
num_labels=2
) - Batch size: 8
- Epochs: 3
- Learning rate: 2e-5
π¬ Performance (Binary Classification)
Metric | Value |
---|---|
Accuracy | ~92% |
Precision / Recall | Balanced |
- Downloads last month
- 2
Model tree for Mridul2003/identity-hate-detector
Base model
unitary/toxic-bert