science-of-finetuning/max-activating-examples-gemma-2-2b-l13-mu4.1e-02-lr1e-04 Viewer • Updated about 2 hours ago • 73.7k • 157
science-of-finetuning/max-activating-examples-gemma-2-2b-l13-mu4.1e-02-lr1e-04 Viewer • Updated about 2 hours ago • 73.7k • 157
science-of-finetuning/max-activating-examples-gemma-2-2b-l13-ckissane Viewer • Updated 8 days ago • 16.4k • 45
science-of-finetuning/max-activating-examples-gemma-2-2b-l13-ckissane Viewer • Updated 8 days ago • 16.4k • 45
science-of-finetuning/max-activating-examples-gemma-2-2b-l13-ckissane Viewer • Updated 8 days ago • 16.4k • 45
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models Paper • 2411.14257 • Published Nov 21, 2024 • 11
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 Paper • 2408.05147 • Published Aug 9, 2024 • 39
Progress measures for grokking via mechanistic interpretability Paper • 2301.05217 • Published Jan 12, 2023
Emergent Linear Representations in World Models of Self-Supervised Sequence Models Paper • 2309.00941 • Published Sep 2, 2023 • 1
Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching Paper • 2311.17030 • Published Nov 28, 2023
AtP*: An efficient and scalable method for localizing LLM behaviour to components Paper • 2403.00745 • Published Mar 1, 2024 • 13
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods Paper • 2309.16042 • Published Sep 27, 2023 • 3
A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations Paper • 2302.03025 • Published Feb 6, 2023