arxiv:2506.10960

ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark

Published on Jun 12

· Submitted by

Ningyu on Jun 13

Upvote

Authors:

Ningyu Zhang ,

Xi Chen ,

Abstract

A benchmark for Chinese harmful content detection is introduced, along with a knowledge-augmented model that enhances efficiency and accuracy using human-annotated rules and LLMs.

AI-generated summary

Large language models (LLMs) have been increasingly applied to automated harmful content detection tasks, assisting moderators in identifying policy violations and improving the overall efficiency and accuracy of content review. However, existing resources for harmful content detection are predominantly focused on English, with Chinese datasets remaining scarce and often limited in scope. We present a comprehensive, professionally annotated benchmark for Chinese content harm detection, which covers six representative categories and is constructed entirely from real-world data. Our annotation process further yields a knowledge rule base that provides explicit expert knowledge to assist LLMs in Chinese harmful content detection. In addition, we propose a knowledge-augmented baseline that integrates both human-annotated knowledge rules and implicit knowledge from large language models, enabling smaller models to achieve performance comparable to state-of-the-art LLMs. Code and data are available at https://github.com/zjunlp/ChineseHarm-bench.

View arXiv page View PDF Add to collection

Community

Ningyu

Paper author Paper submitter 2 days ago

We present ChineseHarm-Bench, a comprehensive, professionally annotated benchmark for Chinese content harm detection, which covers six representative categories and is constructed entirely from real-world data.