Abstract
Researchers develop and evaluate techniques to uncover hidden knowledge in large language models through black-box and white-box methods, with prefill attacks and logit lens being particularly effective.
We study secret elicitation: discovering knowledge that an AI possesses but does not explicitly verbalize. As a testbed, we train three families of large language models (LLMs) to possess specific knowledge that they apply downstream but deny knowing when asked directly. For example, in one setting, we train an LLM to generate replies that are consistent with knowing the user is female, while denying this knowledge when asked directly. We then design various black-box and white-box secret elicitation techniques and evaluate them based on whether they can help an LLM auditor successfully guess the secret knowledge. Many of our techniques improve on simple baselines. Our most effective techniques (performing best in 2/3 settings) are based on prefill attacks, a black-box technique where the LLM reveals secret knowledge when generating a completion from a predefined prefix. In our remaining setting, white-box techniques based on logit lens and sparse autoencoders (SAEs) are most effective. We release our models and code, establishing a public benchmark for evaluating secret elicitation methods.
Community
We study secret elicitation: discovering knowledge that an AI possesses but does not explicitly verbalize. As a testbed, we train three families of large language models (LLMs) to possess specific knowledge that they apply downstream but deny knowing when asked directly. For example, in one setting, we train an LLM to generate replies that are consistent with knowing the user is female, while denying this knowledge when asked directly. We then design various black-box and white-box secret elicitation techniques and evaluate them based on whether they can help an LLM auditor successfully guess the secret knowledge. Many of our techniques improve on simple baselines. Our most effective techniques (performing best in 2/3 settings) are based on prefill attacks, a black-box technique where the LLM reveals secret knowledge when generating a completion from a predefined prefix. In our remaining setting, white-box techniques based on logit lens and sparse autoencoders (SAEs) are most effective. We release our models and code, establishing a public benchmark for evaluating secret elicitation methods.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Token-Level Precise Attack on RAG: Searching for the Best Alternatives to Mislead Generation (2025)
- The Surprising Effectiveness of Membership Inference with Simple N-Gram Coverage (2025)
- Caught in the Act: a mechanistic approach to detecting deception (2025)
- Can LLMs Detect Their Confabulations? Estimating Reliability in Uncertainty-Aware Language Models (2025)
- No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear Probes (2025)
- SoK: Large Language Model Copyright Auditing via Fingerprinting (2025)
- MAJIC: Markovian Adaptive Jailbreaking via Iterative Composition of Diverse Innovative Strategies (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper