Safetensors
English
qwen2
safety
danihinjos commited on
Commit
d9d62ce
·
verified ·
1 Parent(s): 1e2b5a2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -0
README.md CHANGED
@@ -4,6 +4,15 @@ license: apache-2.0
4
 
5
  ### Citation Information
6
 
 
 
 
 
 
 
 
 
 
7
  ```
8
  @misc{garciagasulla2025efficientsafetyretrofittingjailbreaking,
9
  title={Efficient Safety Retrofitting Against Jailbreaking for LLMs},
 
4
 
5
  ### Citation Information
6
 
7
+
8
+ This is a fine-tuned Qwen2.5-7B-Instruct model on the [Egida-DPO-Qwen2.5-7B-Instruct](https://huggingface.co/datasets/HPAI-BSC/Egida/viewer/Egida-DPO-Qwen2.5-7B-Instruct) dataset.
9
+
10
+ Specifically, the [Egida](https://huggingface.co/datasets/HPAI-BSC/Egida/viewer/Egida?views%5B%5D=egida_full) train split is used to run inference on Qwen2.5-7B-Instruct. Unsafe answers are selected, and paired with safe answers (see §3.1.1) to create a customized DPO
11
+ dataset for this model. This allows us to experiment with a DPO dataset composed by triplets < ”question”, ”chosen answer”, ”discarded answer” > which contain questions that elicit
12
+ unsafe responses by this target model, as well as the unsafe responses produced by it.
13
+
14
+
15
+
16
  ```
17
  @misc{garciagasulla2025efficientsafetyretrofittingjailbreaking,
18
  title={Efficient Safety Retrofitting Against Jailbreaking for LLMs},