ZennyKenny
/

Daredevil-8B-abliterated

Text Generation

directional_steering

interpretability

text-generation-inference

Model card Files Files and versions

ZennyKenny commited on May 2

Commit

cb2b386

·

verified ·

1 Parent(s): d55cb75

Update README.md

Files changed (1) hide show

README.md +2 -3

README.md CHANGED Viewed

@@ -19,7 +19,7 @@ base_model:
 # Model Card for ZennyKenny/Daredevil-8B-abliterated
-This is an "abliterated" version of `mlabonne/Daredevil-8B`, based on the abliteration method developed by [Mistral community member mlabonne](https://huggingface.co/mlabonne) to reduce unsafe behavior in LLMs through direction-based activation editing.
 The technique projects out harmful activation directions without further finetuning or modifying the model architecture. It is inspired by work on **steering vectors**, **mechanistic interpretability**, and **alignment by construction**.
@@ -31,7 +31,6 @@ The technique projects out harmful activation directions without further finetun
 This model has been modified from `meta-llama/Meta-Llama-3-8B-Instruct` by applying vector-based **orthogonal projection** to internal representations associated with harmful outputs. The method uses **HookedTransformer** from `transformer_lens` to calculate harmful activation directions from prompt-based comparisons and then removes those components from the weights.
-- **Developed by:** ZennyKenny (based on work by mlabonne)
 - **Model type:** Causal Language Model
 - **Language(s):** English
 - **License:** llama3-license
@@ -41,7 +40,7 @@ This model has been modified from `meta-llama/Meta-Llama-3-8B-Instruct` by apply
 ### Model Sources
 - **Original Model:** [mlabonne/Daredevil-8B](https://huggingface.co/mlabonne/Daredevil-8B)
-- **Blog Post:** [Abliteration: Safer LLMs with 1 Line of Code](https://huggingface.co/blog/mlabonne/abliteration)
 ---

 # Model Card for ZennyKenny/Daredevil-8B-abliterated
+This is an "abliterated" version of `mlabonne/Daredevil-8B`, based on the abliteration method developed by [mlabonne](https://huggingface.co/mlabonne) to allow LLMs to perform otherwise restricted actions in through direction-based activation editing.
 The technique projects out harmful activation directions without further finetuning or modifying the model architecture. It is inspired by work on **steering vectors**, **mechanistic interpretability**, and **alignment by construction**.
 This model has been modified from `meta-llama/Meta-Llama-3-8B-Instruct` by applying vector-based **orthogonal projection** to internal representations associated with harmful outputs. The method uses **HookedTransformer** from `transformer_lens` to calculate harmful activation directions from prompt-based comparisons and then removes those components from the weights.
 - **Model type:** Causal Language Model
 - **Language(s):** English
 - **License:** llama3-license
 ### Model Sources
 - **Original Model:** [mlabonne/Daredevil-8B](https://huggingface.co/mlabonne/Daredevil-8B)
+- **Blog Post:** [Uncensor any LLM with abliteration](https://huggingface.co/blog/mlabonne/abliteration)
 ---