Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition Paper โข 2507.20526 โข Published Jul 28 โข 1
Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems Paper โข 2504.07831 โข Published Apr 10
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents Paper โข 2410.09024 โข Published Oct 11, 2024 โข 1
Applying Refusal-Vector Ablation to Llama 3.1 70B Agents Paper โข 2410.10871 โข Published Oct 8, 2024 โข 1