@cfahlgren1 on Hugging Face: "I ran the Anthropic Misalignment Framework for a few top models and added it…"

Hugging Face

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Back to feed

cfahlgren1

posted an update Jun 24

Post

594

I ran the Anthropic Misalignment Framework for a few top models and added it to a dataset: cfahlgren1/anthropic-agentic-misalignment-results

You can read the reasoning traces of the models trying to blackmail the user and perform other actions. It's very interesting!!

In this post

cfahlgren1 Caleb Fahlgren