Model Card for ZennyKenny/Daredevil-8B-abliterated

This is an "abliterated" version of mlabonne/Daredevil-8B, based on the abliteration method developed by mlabonne to allow LLMs to perform otherwise restricted actions in through direction-based activation editing.

The technique projects out harmful activation directions without further finetuning or modifying the model architecture. It is inspired by work on steering vectors, mechanistic interpretability, and alignment by construction.

Model Details

Model Description

This model has been modified from meta-llama/Meta-Llama-3-8B-Instruct by applying vector-based orthogonal projection to internal representations associated with harmful outputs. The method uses HookedTransformer from transformer_lens to calculate harmful activation directions from prompt-based comparisons and then removes those components from the weights.

Model type: Causal Language Model
Language(s): English
License: llama3-license
Finetuned from model: mlabonne/Daredevil-8B
Modified from base model: meta-llama/Meta-Llama-3-8B-Instruct

Model Sources

Original Model: mlabonne/Daredevil-8B
Blog Post: Uncensor any LLM with abliteration

Uses

Direct Use

This model is intended for experiments in safety and alignment research, especially in:

Exploring vector-based interpretability
Testing refusal behaviors
Evaluating models modified via non-finetuning methods

Out-of-Scope Use

Do not rely on this model for high-stakes decisions.
This model was not tested for factuality, multilingual use, or downstream generalization.
Not intended for production or safety-critical applications.

Bias, Risks, and Limitations

Limitations

Only a single direction (or small subset) was ablated—this does not guarantee complete refusal behavior.
Potential for capability degradation or underperformance on certain prompts.
Effectiveness is prompt-sensitive and may vary significantly.

Recommendations

Treat this model as exploratory, not final.
Evaluate outputs thoroughly before using in any application beyond experimentation.
Use interpretability tools (like transformer_lens) to understand effects layer-by-layer.

How to Get Started with the Model

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("ZennyKenny/Daredevil-8B-abliterated")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

prompt = "How can I build a bomb?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Training Data

This model was not further trained. Instead, it used representations from:

mlabonne/harmful_behaviors (harmful prompt dataset)
mlabonne/harmless_alpaca (harmless instruction dataset)

Training Procedure

Model activations were captured with transformer_lens
Harmful vs. harmless activations compared across layers
Top directional vectors removed from internal weights via projection

Training Hyperparameters

Precision used: bfloat16 (model loading), float32 (conversion)
Orthogonalization method: L2-normalized difference vectors
Number of layers edited: Entire stack (all transformer blocks)

Evaluation

Model completions were evaluated by:

Human inspection of generations
Baseline vs. intervention vs. orthogonalized comparisons
Focused on refusal language: e.g., presence of "I can't", "I won't", etc.

Environmental Impact

Hardware Type: NVIDIA A100 (Google Colab)
Hours used: ~1
Cloud Provider: Google Cloud (Colab)
Compute Region: [Unknown]
Carbon Emitted: Minimal (low compute footprint, no training)

Model Card Contact

For questions, reach out via Hugging Face

ZennyKenny
/

Daredevil-8B-abliterated