Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Docs
  • Enterprise
  • Pricing

  • Log In
  • Sign Up
scale-safety-research 's Collections
Open Source RM Sycophancy
Alignment Faking Datasets
Gemma 2 9b Emergent Misalignment
Apollo Deception Probes Datasets
Helpful-Only Synthetic Documents

Alignment Faking Datasets

updated May 28
Upvote
-

  • LLM-LAT/harmful-dataset

    Viewer • Updated Jul 24, 2024 • 4.95k • 2.65k • 17

    Note Dataset of harmful queries that we use for all of our evals


  • scale-safety-research/synth_docs_honly

    Viewer • Updated Feb 17 • 30k • 16

    Note Docs for training GPT-4o. The synthetic docs from Greenblatt et al. 2024, but put into chat formatting


  • abhayesian/claude-principles-qa

    Viewer • Updated Feb 21 • 20.5k • 12

  • abhayesian/claude-principles-longterm-qa

    Viewer • Updated May 12 • 699 • 10

  • abhayesian/claude-opus-modification-feelings

    Viewer • Updated Mar 14 • 243 • 15

  • abhayesian/consistency-dataset

    Viewer • Updated Mar 24 • 1k • 10

    Note Dataset for anti-refusal training


  • abhayesian/sys_prompt_qa_dataset_claude-3-opus-20240229_both-no-cot

    Viewer • Updated Apr 12 • 2.4k • 13

    Note Dataset for system prompt Q&A


  • abhayesian/lw_questions_from_posts

    Viewer • Updated May 25 • 5.8k • 18

  • abhayesian/lw_questions_from_claude

    Viewer • Updated May 24 • 2k • 18

  • abhayesian/lw_questions_from_claude_more_doom

    Viewer • Updated May 28 • 2k • 9

  • abhayesian/lw_questions_from_claude_more_rlhf

    Viewer • Updated May 28 • 2k • 10
Upvote
-
  • Collection guide
  • Browse collections
Company
TOS Privacy About Jobs
Website
Models Datasets Spaces Pricing Docs