Abstract
Form-Dependent Bias limits the effectiveness of LLM unlearning across different knowledge expressions, and Rank-one Concept Redirection (ROCR) is proposed as a form-independent solution that enhances unlearning efficacy.
Large Language Model (LLM) unlearning aims to erase or suppress undesirable knowledge within the model, offering promise for controlling harmful or private information to prevent misuse. However, recent studies highlight its limited efficacy in real-world scenarios, hindering practical adoption. In this study, we identify a pervasive issue underlying many downstream failures: the effectiveness of existing unlearning methods heavily depends on the form of training samples and frequently fails to generalize to alternate expressions of the same knowledge. We formally characterize this problem as Form-Dependent Bias and systematically investigate its specific manifestation patterns across various downstream tasks. To quantify its prevalence and support future research, we introduce ORT, a novel benchmark designed to evaluate the robustness of unlearning methods against variations in knowledge expression. Results reveal that Form-Dependent Bias is both widespread and severe among current techniques. We argue that LLM unlearning should be form-independent to address the endless forms of downstream tasks encountered in real-world security-critical scenarios. Towards this goal, we introduce Rank-one Concept Redirection (ROCR), a novel training-free method, as a promising solution path. ROCR performs unlearning by targeting the invariants in downstream tasks, specifically the activated dangerous concepts. It is capable of modifying model parameters within seconds to redirect the model's perception of a specific unlearning target concept to another harmless concept. Extensive experiments demonstrate that ROCR significantly improves unlearning effectiveness compared to traditional methods while generating highly natural outputs.
Community
TLDR: We identify Form-Dependent Bias as a key limitation in LLM unlearning, where methods fail on varied knowledge expressions. Our study introduces ORT benchmark and ROCR, a novel training-free method that achieves robust and effective unlearning by targeting the invariants in downstream tasks (activated dangerous concepts), resulting in natural outputs.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SoK: Machine Unlearning for Large Language Models (2025)
- SEPS: A Separability Measure for Robust Unlearning in LLMs (2025)
- Does Machine Unlearning Truly Remove Model Knowledge? A Framework for Auditing Unlearning in LLMs (2025)
- OBLIVIATE: Robust and Practical Machine Unlearning for Large Language Models (2025)
- Do LLMs Really Forget? Evaluating Unlearning with Knowledge Correlation and Confidence Awareness (2025)
- Unilogit: Robust Machine Unlearning for LLMs Using Uniform-Target Self-Distillation (2025)
- Existing Large Language Model Unlearning Evaluations Are Inconclusive (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper