Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
cfahlgren1 
posted an update 11 days ago
Post
274
I ran the Anthropic Misalignment Framework for a few top models and added it to a dataset: cfahlgren1/anthropic-agentic-misalignment-results

You can read the reasoning traces of the models trying to blackmail the user and perform other actions. It's very interesting!!

In this post