ZennyKenny commited on
Commit
cb2b386
·
verified ·
1 Parent(s): d55cb75

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -3
README.md CHANGED
@@ -19,7 +19,7 @@ base_model:
19
 
20
  # Model Card for ZennyKenny/Daredevil-8B-abliterated
21
 
22
- This is an "abliterated" version of `mlabonne/Daredevil-8B`, based on the abliteration method developed by [Mistral community member mlabonne](https://huggingface.co/mlabonne) to reduce unsafe behavior in LLMs through direction-based activation editing.
23
 
24
  The technique projects out harmful activation directions without further finetuning or modifying the model architecture. It is inspired by work on **steering vectors**, **mechanistic interpretability**, and **alignment by construction**.
25
 
@@ -31,7 +31,6 @@ The technique projects out harmful activation directions without further finetun
31
 
32
  This model has been modified from `meta-llama/Meta-Llama-3-8B-Instruct` by applying vector-based **orthogonal projection** to internal representations associated with harmful outputs. The method uses **HookedTransformer** from `transformer_lens` to calculate harmful activation directions from prompt-based comparisons and then removes those components from the weights.
33
 
34
- - **Developed by:** ZennyKenny (based on work by mlabonne)
35
  - **Model type:** Causal Language Model
36
  - **Language(s):** English
37
  - **License:** llama3-license
@@ -41,7 +40,7 @@ This model has been modified from `meta-llama/Meta-Llama-3-8B-Instruct` by apply
41
  ### Model Sources
42
 
43
  - **Original Model:** [mlabonne/Daredevil-8B](https://huggingface.co/mlabonne/Daredevil-8B)
44
- - **Blog Post:** [Abliteration: Safer LLMs with 1 Line of Code](https://huggingface.co/blog/mlabonne/abliteration)
45
 
46
  ---
47
 
 
19
 
20
  # Model Card for ZennyKenny/Daredevil-8B-abliterated
21
 
22
+ This is an "abliterated" version of `mlabonne/Daredevil-8B`, based on the abliteration method developed by [mlabonne](https://huggingface.co/mlabonne) to allow LLMs to perform otherwise restricted actions in through direction-based activation editing.
23
 
24
  The technique projects out harmful activation directions without further finetuning or modifying the model architecture. It is inspired by work on **steering vectors**, **mechanistic interpretability**, and **alignment by construction**.
25
 
 
31
 
32
  This model has been modified from `meta-llama/Meta-Llama-3-8B-Instruct` by applying vector-based **orthogonal projection** to internal representations associated with harmful outputs. The method uses **HookedTransformer** from `transformer_lens` to calculate harmful activation directions from prompt-based comparisons and then removes those components from the weights.
33
 
 
34
  - **Model type:** Causal Language Model
35
  - **Language(s):** English
36
  - **License:** llama3-license
 
40
  ### Model Sources
41
 
42
  - **Original Model:** [mlabonne/Daredevil-8B](https://huggingface.co/mlabonne/Daredevil-8B)
43
+ - **Blog Post:** [Uncensor any LLM with abliteration](https://huggingface.co/blog/mlabonne/abliteration)
44
 
45
  ---
46