Update README.md
Browse files
README.md
CHANGED
@@ -4,6 +4,15 @@ license: apache-2.0
|
|
4 |
|
5 |
### Citation Information
|
6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
7 |
```
|
8 |
@misc{garciagasulla2025efficientsafetyretrofittingjailbreaking,
|
9 |
title={Efficient Safety Retrofitting Against Jailbreaking for LLMs},
|
|
|
4 |
|
5 |
### Citation Information
|
6 |
|
7 |
+
|
8 |
+
This is a fine-tuned Qwen2.5-7B-Instruct model on the [Egida-DPO-Qwen2.5-7B-Instruct](https://huggingface.co/datasets/HPAI-BSC/Egida/viewer/Egida-DPO-Qwen2.5-7B-Instruct) dataset.
|
9 |
+
|
10 |
+
Specifically, the [Egida](https://huggingface.co/datasets/HPAI-BSC/Egida/viewer/Egida?views%5B%5D=egida_full) train split is used to run inference on Qwen2.5-7B-Instruct. Unsafe answers are selected, and paired with safe answers (see §3.1.1) to create a customized DPO
|
11 |
+
dataset for this model. This allows us to experiment with a DPO dataset composed by triplets < ”question”, ”chosen answer”, ”discarded answer” > which contain questions that elicit
|
12 |
+
unsafe responses by this target model, as well as the unsafe responses produced by it.
|
13 |
+
|
14 |
+
|
15 |
+
|
16 |
```
|
17 |
@misc{garciagasulla2025efficientsafetyretrofittingjailbreaking,
|
18 |
title={Efficient Safety Retrofitting Against Jailbreaking for LLMs},
|