Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging Paper • 2412.19512 • Published Dec 27, 2024 • 8
Course-Correction: Safety Alignment Using Synthetic Preferences Paper • 2407.16637 • Published Jul 23, 2024 • 27
Refusal in Language Models Is Mediated by a Single Direction Paper • 2406.11717 • Published Jun 17, 2024 • 3
GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning Paper • 2505.11049 • Published May 16 • 61
Automating Steering for Safe Multimodal Large Language Models Paper • 2507.13255 • Published Jul 17 • 3
The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs Paper • 2507.11097 • Published Jul 15 • 63
T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models Paper • 2407.05965 • Published Jul 8, 2024
Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report Paper • 2507.16534 • Published Jul 22 • 6
AlignGuard-LoRA: Alignment-Preserving Fine-Tuning via Fisher-Guided Decomposition and Riemannian-Geodesic Collision Regularization Paper • 2508.02079 • Published 19 days ago • 2
Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs Paper • 2508.06601 • Published 14 days ago • 6