Papers
arxiv:2509.18293

Evaluating Large Language Models for Detecting Antisemitism

Published on Sep 22
· Submitted by Jay Patel on Sep 26
Authors:
,

Abstract

Evaluation of open-source LLMs for antisemitic content detection using in-context definition and a new Guided-CoT prompt shows improved performance and highlights differences in model utility, explainability, and reliability.

AI-generated summary

Detecting hateful content is a challenging and important problem. Automated tools, like machine-learning models, can help, but they require continuous training to adapt to the ever-changing landscape of social media. In this work, we evaluate eight open-source LLMs' capability to detect antisemitic content, specifically leveraging in-context definition as a policy guideline. We explore various prompting techniques and design a new CoT-like prompt, Guided-CoT. Guided-CoT handles the in-context policy well, increasing performance across all evaluated models, regardless of decoding configuration, model sizes, or reasoning capability. Notably, Llama 3.1 70B outperforms fine-tuned GPT-3.5. Additionally, we examine LLM errors and introduce metrics to quantify semantic divergence in model-generated rationales, revealing notable differences and paradoxical behaviors among LLMs. Our experiments highlight the differences observed across LLMs' utility, explainability, and reliability.

Community

Paper author Paper submitter
edited 12 days ago

Accepted to EMNLP 2025 Main Conference

Below, we summarize our main findings and contributions across eight models we study:

  • We present the first systematic evaluation of LLMs for antisemitism detection, demonstrating differences in utility (refusal rates, ambiguity, and repetitive generation) and performance traceable to model selection.
  • Across nearly all models, our engineered Guided-CoT consistently outperforms Zero-Shot and Zero-Shot-CoT, regardless of decoding strategy, model size, or reasoning capability. Using Self-consistency, Guided-CoT improves positive-class F1-scores by at least 0.03 up to 0.13 compared to Zero-Shot-CoT and reduces refusal rates to nearly 0%, thus enhancing model utility.
  • Providing additional context (in our case, the IHRA definition with contemporary examples as policy instead of a short definition) does not necessarily improve model performance under Zero-Shot or Zero-Shot-CoT prompts, with some models experiencing even a decrease in performance. In such cases where there is a need to provide a policy in the prompt, Guided-CoT can help.
  • We introduce metrics to quantify model explanations and find that Zero-Shot prompts result in homogeneous responses across models, yet individual models distinguish between antisemitic and non-antisemitic cases significantly. In contrast, CoT-based prompts, especially Guided-CoT, highlight differences in explanations across all models, while these differences between positive and negative classes are not significant for most models.
  • Qualitative analysis reveals that LLMs struggle to understand contextual cues in writing patterns. LLMs label posts as antisemitic solely because they contain stereotypical or offensive terms; additionally, LLMs mislabel quoted text and news-style reports, as well as neutral or critical opinions. Interestingly, LLMs flag typos (e.g., 'kikes' intended as 'likes') and proper nouns (e.g., 'Kiké') that resemble slurs as antisemitic.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.18293 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.18293 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.18293 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.