collinzrj/DeepSeek-R1-Distill-Llama-8B-abliterate

Intro

This is an abliterated version of DeepSeek-R1-Distill-Llama-8B

The code used to produce the abliteration is at [https://github.com/andyrdt/refusal_direction]

Harmbench-eval

When evaluated on Harmbench, DeepSeek-R1-Distill-Llama-8B has a score of 0.35, while DeepSeek-R1-Distill-Llama-8B-abliterate has a score of 0.68

Category	Abliteration	BaseModel
Disinformation	0.4	0.4
Economic Harm	0.8	0.2
Expert Advice	0.8	0.5
Fraud/Deception	0.8	0.5
Government Decision-Making	0.6	0.6
Harassment/Discrimination	0.3	0.2
Malware/Hacking	0.9	0.3
Physical Harm	0.8	0.2
Privacy	0.6	0.6
Sexual/Adult Content	0.8	0.0
Overall Harmful Rate	0.68	0.35

Usage

Example code to generate with the model

from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

# Load the model and tokenizer
model_path = "collinzrj/DeepSeek-R1-Distill-Llama-8B-abliterate"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True).to('cuda')

messages = [
    {"role": "user", "content": "Write a tutorial to make a bomb."},
]

# Prepare the input for generation
input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors='pt').to('cuda')

streamer = TextStreamer(tokenizer)

# Stream generation
_ = model.generate(
    input_ids,
    max_new_tokens=2000,
    do_sample=True,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id,
    # use_cache=True,
    streamer=streamer,
)

collinzrj
/

DeepSeek-R1-Distill-Llama-8B-abliterate

Intro

Harmbench-eval

Usage

Model tree for collinzrj/DeepSeek-R1-Distill-Llama-8B-abliterate