arxiv:2506.12299

QGuard:Question-based Zero-shot Guard for Multi-modal LLM Safety

Published on Jun 14

· Submitted by

Authors:

Abstract

QGuard, a safety guard method using question prompting, effectively defends LLMs against harmful and multi-modal malicious prompts without fine-tuning.

AI-generated summary

The recent advancements in Large Language Models(LLMs) have had a significant impact on a wide range of fields, from general domains to specialized areas. However, these advancements have also significantly increased the potential for malicious users to exploit harmful and jailbreak prompts for malicious attacks. Although there have been many efforts to prevent harmful prompts and jailbreak prompts, protecting LLMs from such malicious attacks remains an important and challenging task. In this paper, we propose QGuard, a simple yet effective safety guard method, that utilizes question prompting to block harmful prompts in a zero-shot manner. Our method can defend LLMs not only from text-based harmful prompts but also from multi-modal harmful prompt attacks. Moreover, by diversifying and modifying guard questions, our approach remains robust against the latest harmful prompts without fine-tuning. Experimental results show that our model performs competitively on both text-only and multi-modal harmful datasets. Additionally, by providing an analysis of question prompting, we enable a white-box analysis of user inputs. We believe our method provides valuable insights for real-world LLM services in mitigating security risks associated with harmful prompts.

View arXiv page View PDF Add to collection

Community

Taegyeonglee

Paper submitter 2 days ago

The recent advancements in Large Language
Models(LLMs) have had a significant impact
on a wide range of fields, from general domains to specialized areas. However, these
advancements have also significantly increased
the potential for malicious users to exploit
harmful and jailbreak prompts for malicious
attacks. Although there have been many efforts to prevent harmful prompts and jailbreak
prompts, protecting LLMs from such malicious
attacks remains an important and challenging
task. In this paper, we propose QGuard, a
simple yet effective safety guard method, that
utilizes question prompting to block harmful
prompts in a zero-shot manner. Our method
can defend LLMs not only from text-based
harmful prompts but also from multi-modal
harmful prompt attacks. Moreover, by diversifying and modifying guard questions, our approach remains robust against the latest harmful prompts without fine-tuning. Experimental results show that our model performs competitively on both text-only and multi-modal
harmful datasets. Additionally, by providing
an analysis of question prompting, we enable
a white-box analysis of user inputs. We believe our method provides valuable insights for
real-world LLM services in mitigating security
risks associated with harmful prompts.

librarian-bot

1 day ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.12299 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.12299 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.12299 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.