Universal Jailbreak Suffixes Are Strong Attention Hijackers Paper • 2506.12880 • Published Jun 15 • 5
Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization Paper • 2506.10920 • Published Jun 12 • 6
Enhancing Automated Interpretability with Output-Centric Feature Descriptions Paper • 2501.08319 • Published Jan 14 • 11
CoverBench: A Challenging Benchmark for Complex Claim Verification Paper • 2408.03325 • Published Aug 6, 2024 • 15
Knesset-DictaBERT: A Hebrew Language Model for Parliamentary Proceedings Paper • 2407.20581 • Published Jul 30, 2024 • 25
Evaluating the Ripple Effects of Knowledge Editing in Language Models Paper • 2307.12976 • Published Jul 24, 2023 • 12
From Loops to Oops: Fallback Behaviors of Language Models Under Uncertainty Paper • 2407.06071 • Published Jul 8, 2024 • 7
🔍 Interpretability & Analysis of LMs Collection Outstanding research in LM interpretability and evaluation, summarized • 123 items • Updated 4 days ago • 109
Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces Paper • 2406.11614 • Published Jun 17, 2024 • 5
Estimating Knowledge in Large Language Models Without Generating a Single Token Paper • 2406.12673 • Published Jun 18, 2024 • 8 • 1
Do Large Language Models Latently Perform Multi-Hop Reasoning? Paper • 2402.16837 • Published Feb 26, 2024 • 30
Patchscope: A Unifying Framework for Inspecting Hidden Representations of Language Models Paper • 2401.06102 • Published Jan 11, 2024 • 23
Narrowing the Knowledge Evaluation Gap: Open-Domain Question Answering with Multi-Granularity Answers Paper • 2401.04695 • Published Jan 9, 2024 • 13