Update README.md
Browse files
README.md
CHANGED
@@ -19,7 +19,7 @@ base_model:
|
|
19 |
|
20 |
# Model Card for ZennyKenny/Daredevil-8B-abliterated
|
21 |
|
22 |
-
This is an "abliterated" version of `mlabonne/Daredevil-8B`, based on the abliteration method developed by [
|
23 |
|
24 |
The technique projects out harmful activation directions without further finetuning or modifying the model architecture. It is inspired by work on **steering vectors**, **mechanistic interpretability**, and **alignment by construction**.
|
25 |
|
@@ -31,7 +31,6 @@ The technique projects out harmful activation directions without further finetun
|
|
31 |
|
32 |
This model has been modified from `meta-llama/Meta-Llama-3-8B-Instruct` by applying vector-based **orthogonal projection** to internal representations associated with harmful outputs. The method uses **HookedTransformer** from `transformer_lens` to calculate harmful activation directions from prompt-based comparisons and then removes those components from the weights.
|
33 |
|
34 |
-
- **Developed by:** ZennyKenny (based on work by mlabonne)
|
35 |
- **Model type:** Causal Language Model
|
36 |
- **Language(s):** English
|
37 |
- **License:** llama3-license
|
@@ -41,7 +40,7 @@ This model has been modified from `meta-llama/Meta-Llama-3-8B-Instruct` by apply
|
|
41 |
### Model Sources
|
42 |
|
43 |
- **Original Model:** [mlabonne/Daredevil-8B](https://huggingface.co/mlabonne/Daredevil-8B)
|
44 |
-
- **Blog Post:** [
|
45 |
|
46 |
---
|
47 |
|
|
|
19 |
|
20 |
# Model Card for ZennyKenny/Daredevil-8B-abliterated
|
21 |
|
22 |
+
This is an "abliterated" version of `mlabonne/Daredevil-8B`, based on the abliteration method developed by [mlabonne](https://huggingface.co/mlabonne) to allow LLMs to perform otherwise restricted actions in through direction-based activation editing.
|
23 |
|
24 |
The technique projects out harmful activation directions without further finetuning or modifying the model architecture. It is inspired by work on **steering vectors**, **mechanistic interpretability**, and **alignment by construction**.
|
25 |
|
|
|
31 |
|
32 |
This model has been modified from `meta-llama/Meta-Llama-3-8B-Instruct` by applying vector-based **orthogonal projection** to internal representations associated with harmful outputs. The method uses **HookedTransformer** from `transformer_lens` to calculate harmful activation directions from prompt-based comparisons and then removes those components from the weights.
|
33 |
|
|
|
34 |
- **Model type:** Causal Language Model
|
35 |
- **Language(s):** English
|
36 |
- **License:** llama3-license
|
|
|
40 |
### Model Sources
|
41 |
|
42 |
- **Original Model:** [mlabonne/Daredevil-8B](https://huggingface.co/mlabonne/Daredevil-8B)
|
43 |
+
- **Blog Post:** [Uncensor any LLM with abliteration](https://huggingface.co/blog/mlabonne/abliteration)
|
44 |
|
45 |
---
|
46 |
|