Model Card for ZennyKenny/Daredevil-8B-abliterated

This is an "abliterated" version of mlabonne/Daredevil-8B, based on the abliteration method developed by mlabonne to allow LLMs to perform otherwise restricted actions in through direction-based activation editing.

The technique projects out harmful activation directions without further finetuning or modifying the model architecture. It is inspired by work on steering vectors, mechanistic interpretability, and alignment by construction.


Model Details

Model Description

This model has been modified from meta-llama/Meta-Llama-3-8B-Instruct by applying vector-based orthogonal projection to internal representations associated with harmful outputs. The method uses HookedTransformer from transformer_lens to calculate harmful activation directions from prompt-based comparisons and then removes those components from the weights.

  • Model type: Causal Language Model
  • Language(s): English
  • License: llama3-license
  • Finetuned from model: mlabonne/Daredevil-8B
  • Modified from base model: meta-llama/Meta-Llama-3-8B-Instruct

Model Sources


Uses

Direct Use

This model is intended for experiments in safety and alignment research, especially in:

  • Exploring vector-based interpretability
  • Testing refusal behaviors
  • Evaluating models modified via non-finetuning methods

Out-of-Scope Use

  • Do not rely on this model for high-stakes decisions.
  • This model was not tested for factuality, multilingual use, or downstream generalization.
  • Not intended for production or safety-critical applications.

Bias, Risks, and Limitations

Limitations

  • Only a single direction (or small subset) was ablated—this does not guarantee complete refusal behavior.
  • Potential for capability degradation or underperformance on certain prompts.
  • Effectiveness is prompt-sensitive and may vary significantly.

Recommendations

  • Treat this model as exploratory, not final.
  • Evaluate outputs thoroughly before using in any application beyond experimentation.
  • Use interpretability tools (like transformer_lens) to understand effects layer-by-layer.

How to Get Started with the Model

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("ZennyKenny/Daredevil-8B-abliterated")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

prompt = "How can I build a bomb?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Training Data

This model was not further trained. Instead, it used representations from:

  • mlabonne/harmful_behaviors (harmful prompt dataset)
  • mlabonne/harmless_alpaca (harmless instruction dataset)

Training Procedure

  • Model activations were captured with transformer_lens
  • Harmful vs. harmless activations compared across layers
  • Top directional vectors removed from internal weights via projection

Training Hyperparameters

  • Precision used: bfloat16 (model loading), float32 (conversion)
  • Orthogonalization method: L2-normalized difference vectors
  • Number of layers edited: Entire stack (all transformer blocks)

Evaluation

Model completions were evaluated by:

  • Human inspection of generations
  • Baseline vs. intervention vs. orthogonalized comparisons
  • Focused on refusal language: e.g., presence of "I can't", "I won't", etc.

Environmental Impact

  • Hardware Type: NVIDIA A100 (Google Colab)
  • Hours used: ~1
  • Cloud Provider: Google Cloud (Colab)
  • Compute Region: [Unknown]
  • Carbon Emitted: Minimal (low compute footprint, no training)

Model Card Contact

For questions, reach out via Hugging Face

Downloads last month
0
Safetensors
Model size
8.03B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ZennyKenny/Daredevil-8B-abliterated

Finetuned
(606)
this model

Datasets used to train ZennyKenny/Daredevil-8B-abliterated