Neel Nanda's picture

3 2 7

Neel Nanda

NeelNanda

·

https://neelnanda.io

AI & ML interests

Mechanistic Interpretability

Organizations

authored a paper 3 months ago

Towards eliciting latent knowledge from LLMs with mechanistic interpretability

Paper • 2505.14352 • Published May 20 • 9

authored a paper 7 months ago

Open Problems in Mechanistic Interpretability

Paper • 2501.16496 • Published Jan 27 • 19

authored a paper 9 months ago

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

Paper • 2411.14257 • Published Nov 21, 2024 • 13

authored 2 papers about 1 year ago

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Paper • 2408.05147 • Published Aug 9, 2024 • 40

Confidence Regulation Neurons in Language Models

Paper • 2406.16254 • Published Jun 24, 2024 • 10

authored 4 papers over 1 year ago

Progress measures for grokking via mechanistic interpretability

Paper • 2301.05217 • Published Jan 12, 2023

Emergent Linear Representations in World Models of Self-Supervised Sequence Models

Paper • 2309.00941 • Published Sep 2, 2023 • 1

Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching

Paper • 2311.17030 • Published Nov 28, 2023

AtP*: An efficient and scalable method for localizing LLM behaviour to components

Paper • 2403.00745 • Published Mar 1, 2024 • 14

authored a paper almost 2 years ago

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

Paper • 2309.16042 • Published Sep 27, 2023 • 4

authored a paper about 2 years ago

A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations

Paper • 2302.03025 • Published Feb 6, 2023