Papers
arxiv:2506.06444

Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance

Published on Jun 6
ยท Submitted by q-rz on Jun 10
Authors:
,

Abstract

SAFFRON, a novel inference scaling paradigm, enhances LLM safety by reducing reward model evaluations through a multifurcation reward model and other optimizations.

AI-generated summary

Existing safety assurance research has primarily focused on training-phase alignment to instill safe behaviors into LLMs. However, recent studies have exposed these methods' susceptibility to diverse jailbreak attacks. Concurrently, inference scaling has significantly advanced LLM reasoning capabilities but remains unexplored in the context of safety assurance. Addressing this gap, our work pioneers inference scaling for robust and effective LLM safety against emerging threats. We reveal that conventional inference scaling techniques, despite their success in reasoning tasks, perform poorly in safety contexts, even falling short of basic approaches like Best-of-N Sampling. We attribute this inefficiency to a newly identified challenge, the exploration--efficiency dilemma, arising from the high computational overhead associated with frequent process reward model (PRM) evaluations. To overcome this dilemma, we propose SAFFRON, a novel inference scaling paradigm tailored explicitly for safety assurance. Central to our approach is the introduction of a multifurcation reward model (MRM) that significantly reduces the required number of reward model evaluations. To operationalize this paradigm, we further propose: (i) a partial supervision training objective for MRM, (ii) a conservative exploration constraint to prevent out-of-distribution explorations, and (iii) a Trie-based key--value caching strategy that facilitates cache sharing across sequences during tree search. Extensive experiments validate the effectiveness of our method. Additionally, we publicly release our trained multifurcation reward model (Saffron-1) and the accompanying token-level safety reward dataset (Safety4M) to accelerate future research in LLM safety. Our code, model, and data are publicly available at https://github.com/q-rz/saffron , and our project homepage is at https://q-rz.github.io/p/saffron .

Community

Paper author Paper submitter
โ€ข
edited 2 days ago

๐Ÿ˜ฒ Not only reasoning?! Inference scaling can now boost LLM safety!

๐Ÿš€ Introducing our pioneering work Saffron-1:

  • reduces attack success rate from 66% to 17.5%
  • using only 59.7 TFLOP compute
  • against latest jailbreak attacks
  • without finetuning the language model

on the Ai2 Refusals benchmark.

๐Ÿ“– Paper: https://arxiv.org/pdf/2506.06444
๐Ÿ–ฅ๏ธ Code: https://github.com/q-rz/saffron
๐ŸŒ Webpage: https://q-rz.github.io/p/saffron

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.06444 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.06444 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.06444 in a Space README.md to link it from this page.

Collections including this paper 1