Abstract
Evaluation of open-source LLMs for antisemitic content detection using in-context definition and a new Guided-CoT prompt shows improved performance and highlights differences in model utility, explainability, and reliability.
Detecting hateful content is a challenging and important problem. Automated tools, like machine-learning models, can help, but they require continuous training to adapt to the ever-changing landscape of social media. In this work, we evaluate eight open-source LLMs' capability to detect antisemitic content, specifically leveraging in-context definition as a policy guideline. We explore various prompting techniques and design a new CoT-like prompt, Guided-CoT. Guided-CoT handles the in-context policy well, increasing performance across all evaluated models, regardless of decoding configuration, model sizes, or reasoning capability. Notably, Llama 3.1 70B outperforms fine-tuned GPT-3.5. Additionally, we examine LLM errors and introduce metrics to quantify semantic divergence in model-generated rationales, revealing notable differences and paradoxical behaviors among LLMs. Our experiments highlight the differences observed across LLMs' utility, explainability, and reliability.
Community
Accepted to EMNLP 2025 Main Conference
Below, we summarize our main findings and contributions across eight models we study:
- We present the first systematic evaluation of LLMs for antisemitism detection, demonstrating differences in utility (refusal rates, ambiguity, and repetitive generation) and performance traceable to model selection.
- Across nearly all models, our engineered Guided-CoT consistently outperforms Zero-Shot and Zero-Shot-CoT, regardless of decoding strategy, model size, or reasoning capability. Using Self-consistency, Guided-CoT improves positive-class F1-scores by at least 0.03 up to 0.13 compared to Zero-Shot-CoT and reduces refusal rates to nearly 0%, thus enhancing model utility.
- Providing additional context (in our case, the IHRA definition with contemporary examples as policy instead of a short definition) does not necessarily improve model performance under Zero-Shot or Zero-Shot-CoT prompts, with some models experiencing even a decrease in performance. In such cases where there is a need to provide a policy in the prompt, Guided-CoT can help.
- We introduce metrics to quantify model explanations and find that Zero-Shot prompts result in homogeneous responses across models, yet individual models distinguish between antisemitic and non-antisemitic cases significantly. In contrast, CoT-based prompts, especially Guided-CoT, highlight differences in explanations across all models, while these differences between positive and negative classes are not significant for most models.
- Qualitative analysis reveals that LLMs struggle to understand contextual cues in writing patterns. LLMs label posts as antisemitic solely because they contain stereotypical or offensive terms; additionally, LLMs mislabel quoted text and news-style reports, as well as neutral or critical opinions. Interestingly, LLMs flag typos (e.g., 'kikes' intended as 'likes') and proper nouns (e.g., 'Kiké') that resemble slurs as antisemitic.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Labels or Input? Rethinking Augmentation in Multimodal Hate Detection (2025)
- SMARTER: A Data-efficient Framework to Improve Toxicity Detection with Explanation via Self-augmenting Large Language Models (2025)
- Are LLMs Enough for Hyperpartisan, Fake, Polarized and Harmful Content Detection? Evaluating In-Context Learning vs. Fine-Tuning (2025)
- Specializing General-purpose LLM Embeddings for Implicit Hate Speech Detection across Datasets (2025)
- Distilling Knowledge from Large Language Models: A Concept Bottleneck Model for Hate and Counter Speech Recognition (2025)
- Cyberbullying Detection via Aggression-Enhanced Prompting (2025)
- Scaling behavior of large language models in emotional safety classification across sizes and tasks (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper