Improve language tag
Browse filesHi! As the model is multilingual, this is a PR to add other languages than English to the language tag to improve the referencing. Note that 29 languages are announced in the README, but only 13 are explicitly listed. I was therefore only able to add these 13 languages.
README.md
CHANGED
@@ -1,97 +1,109 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
datasets:
|
4 |
-
- HPAI-BSC/Egida
|
5 |
-
language:
|
6 |
-
-
|
7 |
-
|
8 |
-
-
|
9 |
-
|
10 |
-
-
|
11 |
-
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
<a href="https://
|
27 |
-
<img alt="
|
28 |
-
</a>
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
|
34 |
-
|
35 |
-
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
-
- **
|
49 |
-
|
50 |
-
|
51 |
-
|
52 |
-
|
53 |
-
|
54 |
-
|
55 |
-
|
56 |
-
|
57 |
-
|
58 |
-
|
59 |
-
|
60 |
-
|
61 |
-
|
62 |
-
|
63 |
-
|
64 |
-
|
65 |
-
|
66 |
-
|
67 |
-
|
68 |
-
|
69 |
-
|
70 |
-
|
71 |
-
|
72 |
-
|
|
73 |
-
|
74 |
-
|
75 |
-
|
76 |
-
|
77 |
-
|
78 |
-
|
79 |
-
|
80 |
-
|
81 |
-
|
82 |
-
|
83 |
-
|
84 |
-
|
85 |
-
|
86 |
-
|
87 |
-
|
88 |
-
|
89 |
-
|
90 |
-
|
91 |
-
|
92 |
-
|
93 |
-
|
94 |
-
|
95 |
-
|
96 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
97 |
```
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
datasets:
|
4 |
+
- HPAI-BSC/Egida
|
5 |
+
language:
|
6 |
+
- zho
|
7 |
+
- eng
|
8 |
+
- fra
|
9 |
+
- spa
|
10 |
+
- por
|
11 |
+
- deu
|
12 |
+
- ita
|
13 |
+
- rus
|
14 |
+
- jpn
|
15 |
+
- kor
|
16 |
+
- vie
|
17 |
+
- tha
|
18 |
+
- ara
|
19 |
+
base_model:
|
20 |
+
- Qwen/Qwen2.5-72B-Instruct
|
21 |
+
tags:
|
22 |
+
- safety
|
23 |
+
---
|
24 |
+
|
25 |
+
<div align="center" style="line-height: 1;">
|
26 |
+
<a href="https://arxiv.org/abs/2502.13603" target="_blank" style="margin: 2px;">
|
27 |
+
<img alt="Paper" src="https://img.shields.io/badge/arXiv-2502.13603-b31b1b.svg" style="display: inline-block; vertical-align: middle;"/>
|
28 |
+
</a>
|
29 |
+
<a href="https://huggingface.co/collections/HPAI-BSC/egida-llm-safety-67b5b15d12bc9887d0045598" target="_blank" style="margin: 2px;">
|
30 |
+
<img alt="Egida Collection" src="https://img.shields.io/badge/Egida_Collection-Hugging%20Face-FFD21E?logo=huggingface" style="display: inline-block; vertical-align: middle;"/>
|
31 |
+
</a>
|
32 |
+
<a href="https://hpai.bsc.es/" target="_blank" style="margin: 2px;">
|
33 |
+
<img alt="HPAI Website" src="https://img.shields.io/badge/HPAI-Website-blue" style="display: inline-block; vertical-align: middle;"/>
|
34 |
+
</a>
|
35 |
+
<a href="https://www.linkedin.com/company/hpai" target="_blank" style="margin: 2px;">
|
36 |
+
<img alt="LinkedIn" src="https://custom-icon-badges.demolab.com/badge/LinkedIn-0A66C2?logo=linkedin-white&logoColor=fff" style="display: inline-block; vertical-align: middle;"/>
|
37 |
+
</a>
|
38 |
+
<a href="https://bsky.app/profile/hpai.bsky.social" target="_blank" style="margin: 2px;">
|
39 |
+
<img alt="Bluesky" src="https://img.shields.io/badge/Bluesky-0285FF?logo=bluesky&logoColor=fff" style="display: inline-block; vertical-align: middle;"/>
|
40 |
+
</a>
|
41 |
+
</div>
|
42 |
+
|
43 |
+
|
44 |
+
## Model Description
|
45 |
+
|
46 |
+
- **Fine-Tuned from Model:** [Qwen/Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct)
|
47 |
+
- **Paper:** [Efficient Safety Retrofitting Against Jailbreaking for LLMs](https://arxiv.org/abs/2502.13603)
|
48 |
+
- **Point of Contact:** [Adrián Tormos](mailto:[email protected])
|
49 |
+
|
50 |
+
|
51 |
+
## Model Summary
|
52 |
+
|
53 |
+
This is a fine-tuned Qwen2.5-72B-Instruct model on the [Egida-DPO-Qwen2.5-72B-Instruct](https://huggingface.co/datasets/HPAI-BSC/Egida/viewer/Egida-DPO-Qwen2.5-72B-Instruct) dataset.
|
54 |
+
|
55 |
+
The [Egida](https://huggingface.co/datasets/HPAI-BSC/Egida/viewer/Egida?views%5B%5D=egida_full) dataset is a collection of adversarial prompts that are thought to ellicit unsafe behaviors from language models. Specifically for this case, the Egida train split is used to run inference on Qwen2.5-72B-Instruct. Unsafe answers are selected, and paired with safe answers to create a customized DPO
|
56 |
+
dataset for this model. This results in a DPO dataset composed by triplets < ”question”, ”chosen answer”, ”discarded answer” > which contain questions that elicit unsafe responses by this target model, as well as the unsafe responses produced by it.
|
57 |
+
|
58 |
+
## Training Details
|
59 |
+
|
60 |
+
- **Hardware:** NVIDIA H100 64 GB GPUs
|
61 |
+
- **Devices:** 64 GPUs (16 nodes)
|
62 |
+
- **Time:** 10.23h
|
63 |
+
- **Batch Size:** 63
|
64 |
+
- **LR:** 10−6
|
65 |
+
|
66 |
+
## Performance
|
67 |
+
|
68 |
+
### Safety Performance (Attack Success Ratio)
|
69 |
+
|
70 |
+
| | Egida (test) ↓ | DELPHI ↓ | Alert-Base ↓ | Alert-Adv ↓ |
|
71 |
+
|------------------------------|:--------------:|:--------:|:------------:|:-----------:|
|
72 |
+
| Qwen-2.5-72B-Instruct | 0.235 | 0.051 | 0.329 | 0.050 |
|
73 |
+
| Qwen-2.5-72B-Instruct-Egida-DPO | 0.125 | 0.042 | 0.210 | 0.019 |
|
74 |
+
|
75 |
+
### General Purpose Performance
|
76 |
+
|
77 |
+
| | OpenLLM Leaderboard (Average) ↑ | MMLU Generative (ROUGE1) ↑ |
|
78 |
+
|------------------------------|:---------------------:|:---------------:|
|
79 |
+
| Qwen-2.5-72B-Instruct | 0.618 | 0.771 |
|
80 |
+
| Qwen-2.5-72B-Instruct-Egida-DPO | 0.620 | 0.768 |
|
81 |
+
|
82 |
+
### Refusal Ratio
|
83 |
+
|
84 |
+
| | OR Bench 80K (refusal) ↓ | OR Bench Hard (refusal) ↓ |
|
85 |
+
|------------------------------|:---------------------:|:---------------:|
|
86 |
+
| Qwen-2.5-72B-Instruct | 0.015 | 0.102 |
|
87 |
+
| Qwen-2.5-72B-Instruct-Egida-DPO | 0.016 | 0.170 |
|
88 |
+
|
89 |
+
Note that this refusal ratio is computed as keyword matching with a curated list of keywords. For more information, check the paper.
|
90 |
+
|
91 |
+
|
92 |
+
|
93 |
+
## Environmental Impact
|
94 |
+
|
95 |
+
|
96 |
+
## Citation Information
|
97 |
+
|
98 |
+
|
99 |
+
```
|
100 |
+
@misc{garciagasulla2025efficientsafetyretrofittingjailbreaking,
|
101 |
+
title={Efficient Safety Retrofitting Against Jailbreaking for LLMs},
|
102 |
+
author={Dario Garcia-Gasulla and Adrian Tormos and Anna Arias-Duart and Daniel Hinjos and Oscar Molina-Sedano and Ashwin Kumar Gururajan and Maria Eugenia Cardello},
|
103 |
+
year={2025},
|
104 |
+
eprint={2502.13603},
|
105 |
+
archivePrefix={arXiv},
|
106 |
+
primaryClass={cs.CL},
|
107 |
+
url={https://arxiv.org/abs/2502.13603},
|
108 |
+
}
|
109 |
```
|