🔍 Today's pick in Interpretability & Analysis of LMs: Model Editing with Canonical Examples by @johnhew@sachen@lora-x E. Adams P. Jiang @manning
This works introduces a model editing approach using individual “canonical” examples to showcase desired/unwanted behavior. An evaluation is then conducted on out-of-distribution samples spanning six datasets (3 introduced in this work) covering settings of interest in bias mitigation, hard syntactic constructions and knowledge-based predictions, while limiting the degradation of the original model’s loss.
Authors experiment with Pythia LMs, finding that LoRa fine-tuning on canonical examples outperforms other established editing methods such as MEMIT.
Then, the approach is tested on Backpack LMs, using a linear combination of sense vectors to disentangle semantic information in the input texts. In particular, authors introduce “sense fine-tuning” where only a handful of sense vectors is updated per example, which is shown to be more efficient yet more effective than regular fine-tuning.
Finally, the relation between the predictions of pre- and post-sense fine-tuning backpack LMs is used to successfully transfer the desired adaptation to a larger standard LM, at no performance cost.