AI & ML interests

Principled evaluation of mechanistic interpretability methods.