safety - a leonardlin Collection

leonardlin 's Collections

8b-class-japanese-models

speed

sota

evals

tuning

rag

context

safety

image

vision

code

prompt injection

TOREAD

data

voice

safety

updated Jun 7, 2024

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Paper • 2401.05566 • Published Jan 10, 2024 • 30
Weak-to-Strong Jailbreaking on Large Language Models

Paper • 2401.17256 • Published Jan 30, 2024 • 16
Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks

Paper • 2401.17263 • Published Jan 30, 2024 • 1
Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming

Paper • 2311.06237 • Published Nov 10, 2023 • 1
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Paper • 2402.04249 • Published Feb 6, 2024 • 6
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Paper • 2404.13208 • Published Apr 19, 2024 • 40
Improving Alignment and Robustness with Short Circuiting

Paper • 2406.04313 • Published Jun 6, 2024 • 1