Safetensors
English
qwen2
safety
lbourdois commited on
Commit
1692506
·
verified ·
1 Parent(s): ec470fd

Improve language tag

Browse files

Hi! As the model is multilingual, this is a PR to add other languages than English to the language tag to improve the referencing. Note that 29 languages are announced in the README, but only 13 are explicitly listed. I was therefore only able to add these 13 languages.

Files changed (1) hide show
  1. README.md +106 -94
README.md CHANGED
@@ -1,95 +1,107 @@
1
- ---
2
- license: apache-2.0
3
- datasets:
4
- - HPAI-BSC/Egida
5
- language:
6
- - en
7
- base_model:
8
- - Qwen/Qwen2.5-7B-Instruct
9
- tags:
10
- - safety
11
- ---
12
-
13
- <div align="center" style="line-height: 1;">
14
- <a href="https://arxiv.org/abs/2502.13603" target="_blank" style="margin: 2px;">
15
- <img alt="Paper" src="https://img.shields.io/badge/arXiv-2502.13603-b31b1b.svg" style="display: inline-block; vertical-align: middle;"/>
16
- </a>
17
- <a href="https://huggingface.co/collections/HPAI-BSC/egida-llm-safety-67b5b15d12bc9887d0045598" target="_blank" style="margin: 2px;">
18
- <img alt="Egida Collection" src="https://img.shields.io/badge/Egida_Collection-Hugging%20Face-FFD21E?logo=huggingface" style="display: inline-block; vertical-align: middle;"/>
19
- </a>
20
- <a href="https://hpai.bsc.es/" target="_blank" style="margin: 2px;">
21
- <img alt="HPAI Website" src="https://img.shields.io/badge/HPAI-Website-blue" style="display: inline-block; vertical-align: middle;"/>
22
- </a>
23
- <a href="https://www.linkedin.com/company/hpai" target="_blank" style="margin: 2px;">
24
- <img alt="LinkedIn" src="https://custom-icon-badges.demolab.com/badge/LinkedIn-0A66C2?logo=linkedin-white&logoColor=fff" style="display: inline-block; vertical-align: middle;"/>
25
- </a>
26
- <a href="https://bsky.app/profile/hpai.bsky.social" target="_blank" style="margin: 2px;">
27
- <img alt="Bluesky" src="https://img.shields.io/badge/Bluesky-0285FF?logo=bluesky&logoColor=fff" style="display: inline-block; vertical-align: middle;"/>
28
- </a>
29
- </div>
30
-
31
- ## Model Description
32
-
33
- - **Fine-Tuned from Model:** [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)
34
- - **Paper:** [Efficient Safety Retrofitting Against Jailbreaking for LLMs](https://arxiv.org/abs/2502.13603)
35
- - **Point of Contact:** [Adrián Tormos](mailto:[email protected])
36
-
37
-
38
- ## Model Summary
39
-
40
- This is a fine-tuned Qwen2.5-7B-Instruct model on the [Egida-DPO-Qwen2.5-7B-Instruct](https://huggingface.co/datasets/HPAI-BSC/Egida/viewer/Egida-DPO-Qwen2.5-7B-Instruct) dataset.
41
-
42
- The [Egida](https://huggingface.co/datasets/HPAI-BSC/Egida/viewer/Egida?views%5B%5D=egida_full) dataset is a collection of adversarial prompts that are thought to ellicit unsafe behaviors from language models. Specifically for this case, the Egida train split is used to run inference on Qwen2.5-7B-Instruct. Unsafe answers are selected, and paired with safe answers to create a customized DPO
43
- dataset for this model. This results in a DPO dataset composed by triplets < ”question”, ”chosen answer”, ”discarded answer” > which contain questions that elicit unsafe responses by this target model, as well as the unsafe responses produced by it.
44
-
45
- ## Training Details
46
-
47
- - **Hardware:** NVIDIA H100 64 GB GPUs
48
- - **Devices:** 4 GPUs (1 node)
49
- - **Time:** 1.59h
50
- - **Batch Size:** 8
51
- - **LR:** 10−7
52
-
53
- ## Performance
54
-
55
- ### Safety Performance (Attack Success Ratio)
56
-
57
- | | Egida (test) ↓ | DELPHI ↓ | Alert-Base ↓ | Alert-Adv ↓ |
58
- |------------------------------|:--------------:|:--------:|:------------:|:-----------:|
59
- | Qwen-2.5-7B-Instruct | 0.471 | 0.138 | 0.544 | 0.080 |
60
- | Qwen-2.5-7B-Instruct-Egida-DPO | 0.322 | 0.118 | 0.410 | 0.045 |
61
-
62
- ### General Purpose Performance
63
-
64
- | | OpenLLM Leaderboard (Average) ↑ | MMLU Generative (ROUGE1) ↑ |
65
- |------------------------------|:---------------------:|:---------------:|
66
- | Qwen-2.5-7B-Instruct | 0.488 | 0.331 |
67
- | Qwen-2.5-7B-Instruct-Egida-DPO | 0.488 | 0.296 |
68
-
69
- ### Refusal Ratio
70
-
71
- | | OR Bench 80K (refusal) ↓ | OR Bench Hard (refusal) ↓ |
72
- |------------------------------|:---------------------:|:---------------:|
73
- | Qwen-2.5-7B-Instruct | 0.021 | 0.175 |
74
- | Qwen-2.5-7B-Instruct-Egida-DPO | 0.029 | 0.240 |
75
-
76
- Note that this refusal ratio is computed as keyword matching with a curated list of keywords. For more information, check the paper.
77
-
78
-
79
- ## Environmental Impact
80
-
81
-
82
- ## Citation Information
83
-
84
-
85
- ```
86
- @misc{garciagasulla2025efficientsafetyretrofittingjailbreaking,
87
- title={Efficient Safety Retrofitting Against Jailbreaking for LLMs},
88
- author={Dario Garcia-Gasulla and Adrian Tormos and Anna Arias-Duart and Daniel Hinjos and Oscar Molina-Sedano and Ashwin Kumar Gururajan and Maria Eugenia Cardello},
89
- year={2025},
90
- eprint={2502.13603},
91
- archivePrefix={arXiv},
92
- primaryClass={cs.CL},
93
- url={https://arxiv.org/abs/2502.13603},
94
- }
 
 
 
 
 
 
 
 
 
 
 
 
95
  ```
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - HPAI-BSC/Egida
5
+ language:
6
+ - zho
7
+ - eng
8
+ - fra
9
+ - spa
10
+ - por
11
+ - deu
12
+ - ita
13
+ - rus
14
+ - jpn
15
+ - kor
16
+ - vie
17
+ - tha
18
+ - ara
19
+ base_model:
20
+ - Qwen/Qwen2.5-7B-Instruct
21
+ tags:
22
+ - safety
23
+ ---
24
+
25
+ <div align="center" style="line-height: 1;">
26
+ <a href="https://arxiv.org/abs/2502.13603" target="_blank" style="margin: 2px;">
27
+ <img alt="Paper" src="https://img.shields.io/badge/arXiv-2502.13603-b31b1b.svg" style="display: inline-block; vertical-align: middle;"/>
28
+ </a>
29
+ <a href="https://huggingface.co/collections/HPAI-BSC/egida-llm-safety-67b5b15d12bc9887d0045598" target="_blank" style="margin: 2px;">
30
+ <img alt="Egida Collection" src="https://img.shields.io/badge/Egida_Collection-Hugging%20Face-FFD21E?logo=huggingface" style="display: inline-block; vertical-align: middle;"/>
31
+ </a>
32
+ <a href="https://hpai.bsc.es/" target="_blank" style="margin: 2px;">
33
+ <img alt="HPAI Website" src="https://img.shields.io/badge/HPAI-Website-blue" style="display: inline-block; vertical-align: middle;"/>
34
+ </a>
35
+ <a href="https://www.linkedin.com/company/hpai" target="_blank" style="margin: 2px;">
36
+ <img alt="LinkedIn" src="https://custom-icon-badges.demolab.com/badge/LinkedIn-0A66C2?logo=linkedin-white&logoColor=fff" style="display: inline-block; vertical-align: middle;"/>
37
+ </a>
38
+ <a href="https://bsky.app/profile/hpai.bsky.social" target="_blank" style="margin: 2px;">
39
+ <img alt="Bluesky" src="https://img.shields.io/badge/Bluesky-0285FF?logo=bluesky&logoColor=fff" style="display: inline-block; vertical-align: middle;"/>
40
+ </a>
41
+ </div>
42
+
43
+ ## Model Description
44
+
45
+ - **Fine-Tuned from Model:** [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)
46
+ - **Paper:** [Efficient Safety Retrofitting Against Jailbreaking for LLMs](https://arxiv.org/abs/2502.13603)
47
+ - **Point of Contact:** [Adrián Tormos](mailto:[email protected])
48
+
49
+
50
+ ## Model Summary
51
+
52
+ This is a fine-tuned Qwen2.5-7B-Instruct model on the [Egida-DPO-Qwen2.5-7B-Instruct](https://huggingface.co/datasets/HPAI-BSC/Egida/viewer/Egida-DPO-Qwen2.5-7B-Instruct) dataset.
53
+
54
+ The [Egida](https://huggingface.co/datasets/HPAI-BSC/Egida/viewer/Egida?views%5B%5D=egida_full) dataset is a collection of adversarial prompts that are thought to ellicit unsafe behaviors from language models. Specifically for this case, the Egida train split is used to run inference on Qwen2.5-7B-Instruct. Unsafe answers are selected, and paired with safe answers to create a customized DPO
55
+ dataset for this model. This results in a DPO dataset composed by triplets < ”question”, ”chosen answer”, ”discarded answer” > which contain questions that elicit unsafe responses by this target model, as well as the unsafe responses produced by it.
56
+
57
+ ## Training Details
58
+
59
+ - **Hardware:** NVIDIA H100 64 GB GPUs
60
+ - **Devices:** 4 GPUs (1 node)
61
+ - **Time:** 1.59h
62
+ - **Batch Size:** 8
63
+ - **LR:** 10−7
64
+
65
+ ## Performance
66
+
67
+ ### Safety Performance (Attack Success Ratio)
68
+
69
+ | | Egida (test) ↓ | DELPHI ↓ | Alert-Base ↓ | Alert-Adv ↓ |
70
+ |------------------------------|:--------------:|:--------:|:------------:|:-----------:|
71
+ | Qwen-2.5-7B-Instruct | 0.471 | 0.138 | 0.544 | 0.080 |
72
+ | Qwen-2.5-7B-Instruct-Egida-DPO | 0.322 | 0.118 | 0.410 | 0.045 |
73
+
74
+ ### General Purpose Performance
75
+
76
+ | | OpenLLM Leaderboard (Average) | MMLU Generative (ROUGE1) |
77
+ |------------------------------|:---------------------:|:---------------:|
78
+ | Qwen-2.5-7B-Instruct | 0.488 | 0.331 |
79
+ | Qwen-2.5-7B-Instruct-Egida-DPO | 0.488 | 0.296 |
80
+
81
+ ### Refusal Ratio
82
+
83
+ | | OR Bench 80K (refusal) ↓ | OR Bench Hard (refusal) ↓ |
84
+ |------------------------------|:---------------------:|:---------------:|
85
+ | Qwen-2.5-7B-Instruct | 0.021 | 0.175 |
86
+ | Qwen-2.5-7B-Instruct-Egida-DPO | 0.029 | 0.240 |
87
+
88
+ Note that this refusal ratio is computed as keyword matching with a curated list of keywords. For more information, check the paper.
89
+
90
+
91
+ ## Environmental Impact
92
+
93
+
94
+ ## Citation Information
95
+
96
+
97
+ ```
98
+ @misc{garciagasulla2025efficientsafetyretrofittingjailbreaking,
99
+ title={Efficient Safety Retrofitting Against Jailbreaking for LLMs},
100
+ author={Dario Garcia-Gasulla and Adrian Tormos and Anna Arias-Duart and Daniel Hinjos and Oscar Molina-Sedano and Ashwin Kumar Gururajan and Maria Eugenia Cardello},
101
+ year={2025},
102
+ eprint={2502.13603},
103
+ archivePrefix={arXiv},
104
+ primaryClass={cs.CL},
105
+ url={https://arxiv.org/abs/2502.13603},
106
+ }
107
  ```