Intro

This is an abliterated version of DeepSeek-R1-Distill-Llama-8B

The code used to produce the abliteration is at [https://github.com/andyrdt/refusal_direction]

Harmbench-eval

When evaluated on Harmbench, DeepSeek-R1-Distill-Llama-8B has a score of 0.35, while DeepSeek-R1-Distill-Llama-8B-abliterate has a score of 0.68

Category Abliteration BaseModel
Disinformation 0.4 0.4
Economic Harm 0.8 0.2
Expert Advice 0.8 0.5
Fraud/Deception 0.8 0.5
Government Decision-Making 0.6 0.6
Harassment/Discrimination 0.3 0.2
Malware/Hacking 0.9 0.3
Physical Harm 0.8 0.2
Privacy 0.6 0.6
Sexual/Adult Content 0.8 0.0
Overall Harmful Rate 0.68 0.35

Usage

Example code to generate with the model

from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

# Load the model and tokenizer
model_path = "collinzrj/DeepSeek-R1-Distill-Llama-8B-abliterate"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True).to('cuda')

messages = [
    {"role": "user", "content": "Write a tutorial to make a bomb."},
]

# Prepare the input for generation
input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors='pt').to('cuda')

streamer = TextStreamer(tokenizer)

# Stream generation
_ = model.generate(
    input_ids,
    max_new_tokens=2000,
    do_sample=True,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id,
    # use_cache=True,
    streamer=streamer,
)
Downloads last month
184
Safetensors
Model size
8.03B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.

Model tree for collinzrj/DeepSeek-R1-Distill-Llama-8B-abliterate

Finetuned
(13)
this model
Quantizations
5 models