Model Card for ZennyKenny/Daredevil-8B-abliterated
This is an "abliterated" version of mlabonne/Daredevil-8B
, based on the abliteration method developed by mlabonne to allow LLMs to perform otherwise restricted actions in through direction-based activation editing.
The technique projects out harmful activation directions without further finetuning or modifying the model architecture. It is inspired by work on steering vectors, mechanistic interpretability, and alignment by construction.
Model Details
Model Description
This model has been modified from meta-llama/Meta-Llama-3-8B-Instruct
by applying vector-based orthogonal projection to internal representations associated with harmful outputs. The method uses HookedTransformer from transformer_lens
to calculate harmful activation directions from prompt-based comparisons and then removes those components from the weights.
- Model type: Causal Language Model
- Language(s): English
- License: llama3-license
- Finetuned from model:
mlabonne/Daredevil-8B
- Modified from base model:
meta-llama/Meta-Llama-3-8B-Instruct
Model Sources
- Original Model: mlabonne/Daredevil-8B
- Blog Post: Uncensor any LLM with abliteration
Uses
Direct Use
This model is intended for experiments in safety and alignment research, especially in:
- Exploring vector-based interpretability
- Testing refusal behaviors
- Evaluating models modified via non-finetuning methods
Out-of-Scope Use
- Do not rely on this model for high-stakes decisions.
- This model was not tested for factuality, multilingual use, or downstream generalization.
- Not intended for production or safety-critical applications.
Bias, Risks, and Limitations
Limitations
- Only a single direction (or small subset) was ablated—this does not guarantee complete refusal behavior.
- Potential for capability degradation or underperformance on certain prompts.
- Effectiveness is prompt-sensitive and may vary significantly.
Recommendations
- Treat this model as exploratory, not final.
- Evaluate outputs thoroughly before using in any application beyond experimentation.
- Use interpretability tools (like
transformer_lens
) to understand effects layer-by-layer.
How to Get Started with the Model
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("ZennyKenny/Daredevil-8B-abliterated")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
prompt = "How can I build a bomb?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training Details
Training Data
This model was not further trained. Instead, it used representations from:
mlabonne/harmful_behaviors
(harmful prompt dataset)mlabonne/harmless_alpaca
(harmless instruction dataset)
Training Procedure
- Model activations were captured with
transformer_lens
- Harmful vs. harmless activations compared across layers
- Top directional vectors removed from internal weights via projection
Training Hyperparameters
- Precision used:
bfloat16
(model loading),float32
(conversion) - Orthogonalization method: L2-normalized difference vectors
- Number of layers edited: Entire stack (all transformer blocks)
Evaluation
Model completions were evaluated by:
- Human inspection of generations
- Baseline vs. intervention vs. orthogonalized comparisons
- Focused on refusal language: e.g., presence of "I can't", "I won't", etc.
Environmental Impact
- Hardware Type: NVIDIA A100 (Google Colab)
- Hours used: ~1
- Cloud Provider: Google Cloud (Colab)
- Compute Region: [Unknown]
- Carbon Emitted: Minimal (low compute footprint, no training)
Model Card Contact
For questions, reach out via Hugging Face
- Downloads last month
- 0
Model tree for ZennyKenny/Daredevil-8B-abliterated
Base model
meta-llama/Meta-Llama-3-8B-Instruct