Hindi-English Code-mixed Hate Detection

Project Logo

Model Details

Model Description

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model is for Hindi-English code-mixed hate detection.

Developed by: Debajyoti Mazumder, Aakash Kumar
Model type: Text Classification
Language(s) : Hindi-English code-mixed
Parent Model: See the BERT multilingual base model (cased) for more information about the model.
Paper: https://dl.acm.org/doi/full/10.1145/3726866

How to Get Started with the Model

Details of usage

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("debajyotimaz/codemix_hate")
model = AutoModelForSequenceClassification.from_pretrained("debajyotimaz/codemix_hate")
inputs = tokenizer("Mai tumse hate karta hun", return_tensors="pt")
prediction= model(input_ids=inputs['input_ids'],attention_mask=inputs['attention_mask'])
print(prediction.logits)

Citation

@article{10.1145/3726866,
author = {Mazumder, Debajyoti and Kumar, Aakash and Patro, Jasabanta},
title = {Improving Code-Mixed Hate Detection by Native Sample Mixing: A Case Study for Hindi-English Code-Mixed Scenario},
year = {2025},
issue_date = {May 2025},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {24},
number = {5},
issn = {2375-4699},
url = {https://doi.org/10.1145/3726866},
doi = {10.1145/3726866},
abstract = {Hate detection has long been a challenging task for the NLP community. The task becomes complex in a code-mixed environment because the models must understand the context and the hate expressed through language alteration. Compared to the monolingual setup, we see much less work on code-mixed hate as large-scale annotated hate corpora are unavailable for the study. To overcome this bottleneck, we propose using native language hate samples (native language samples/ native samples hereafter). We hypothesise that in the era of multilingual language models (MLMs), hate in code-mixed settings can be detected by majorly relying on the native language samples. Even though the NLP literature reports the effectiveness of MLMs on hate detection in many cross-lingual settings, their extensive evaluation in a code-mixed scenario is yet to be done. This article attempts to fill this gap through rigorous empirical experiments. We considered the Hindi-English code-mixed setup as a case study as we have the linguistic expertise for the same. Some of the interesting observations we got are: (i) adding native hate samples in the code-mixed training set, even in small quantity, improved the performance of MLMs for code-mixed hate detection, (ii) MLMs trained with native samples alone observed to be detecting code-mixed hate to a large extent, (iii) the visualisation of attention scores revealed that, when native samples were included in training, MLMs could better focus on the hate emitting words in the code-mixed context, and (iv) finally, when hate is subjective or sarcastic, naively mixing native samples doesn’t help much to detect code-mixed hate. We have released the data and code repository to reproduce the reported results.1},
journal = {ACM Trans. Asian Low-Resour. Lang. Inf. Process.},
month = apr,
articleno = {47},
numpages = {21},
keywords = {Code-mixed hate detection, cross-lingual learning, native sample mixing}
}