Towards eliciting latent knowledge from LLMs with mechanistic interpretability Paper • 2505.14352 • Published May 20 • 9
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models Paper • 2411.14257 • Published Nov 21, 2024 • 13
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 Paper • 2408.05147 • Published Aug 9, 2024 • 40
Progress measures for grokking via mechanistic interpretability Paper • 2301.05217 • Published Jan 12, 2023
Emergent Linear Representations in World Models of Self-Supervised Sequence Models Paper • 2309.00941 • Published Sep 2, 2023 • 1
Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching Paper • 2311.17030 • Published Nov 28, 2023
AtP*: An efficient and scalable method for localizing LLM behaviour to components Paper • 2403.00745 • Published Mar 1, 2024 • 14
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods Paper • 2309.16042 • Published Sep 27, 2023 • 4
A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations Paper • 2302.03025 • Published Feb 6, 2023