AI & ML interests
Probing, contrast-consistent search, inference-time intervention, truthfulness, deception, mechanistic interpretability, RLHF
models
0
None public yet
datasets
0
None public yet
Probing, contrast-consistent search, inference-time intervention, truthfulness, deception, mechanistic interpretability, RLHF