QGuard:Question-based Zero-shot Guard for Multi-modal LLM Safety
Abstract
QGuard, a safety guard method using question prompting, effectively defends LLMs against harmful and multi-modal malicious prompts without fine-tuning.
The recent advancements in Large Language Models(LLMs) have had a significant impact on a wide range of fields, from general domains to specialized areas. However, these advancements have also significantly increased the potential for malicious users to exploit harmful and jailbreak prompts for malicious attacks. Although there have been many efforts to prevent harmful prompts and jailbreak prompts, protecting LLMs from such malicious attacks remains an important and challenging task. In this paper, we propose QGuard, a simple yet effective safety guard method, that utilizes question prompting to block harmful prompts in a zero-shot manner. Our method can defend LLMs not only from text-based harmful prompts but also from multi-modal harmful prompt attacks. Moreover, by diversifying and modifying guard questions, our approach remains robust against the latest harmful prompts without fine-tuning. Experimental results show that our model performs competitively on both text-only and multi-modal harmful datasets. Additionally, by providing an analysis of question prompting, we enable a white-box analysis of user inputs. We believe our method provides valuable insights for real-world LLM services in mitigating security risks associated with harmful prompts.
Community
The recent advancements in Large Language
Models(LLMs) have had a significant impact
on a wide range of fields, from general domains to specialized areas. However, these
advancements have also significantly increased
the potential for malicious users to exploit
harmful and jailbreak prompts for malicious
attacks. Although there have been many efforts to prevent harmful prompts and jailbreak
prompts, protecting LLMs from such malicious
attacks remains an important and challenging
task. In this paper, we propose QGuard, a
simple yet effective safety guard method, that
utilizes question prompting to block harmful
prompts in a zero-shot manner. Our method
can defend LLMs not only from text-based
harmful prompts but also from multi-modal
harmful prompt attacks. Moreover, by diversifying and modifying guard questions, our approach remains robust against the latest harmful prompts without fine-tuning. Experimental results show that our model performs competitively on both text-only and multi-modal
harmful datasets. Additionally, by providing
an analysis of question prompting, we enable
a white-box analysis of user inputs. We believe our method provides valuable insights for
real-world LLM services in mitigating security
risks associated with harmful prompts.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Detecting Harmful Memes with Decoupled Understanding and Guided CoT Reasoning (2025)
- MrGuard: A Multilingual Reasoning Guardrail for Universal LLM Safety (2025)
- From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment (2025)
- One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs (2025)
- BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream Camouflage (2025)
- SafeTy Reasoning Elicitation Alignment for Multi-Turn Dialogues (2025)
- Exploring Jailbreak Attacks on LLMs through Intent Concealment and Diversion (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper