HPAI-BSC
/

Qwen2.5-72B-Instruct-Egida-DPO

Safetensors

English

qwen2

safety

Model card Files Files and versions Community

Improve language tag

by lbourdois - opened Apr 28

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+108

-96

Files changed (1) hide show

README.md +108 -96

README.md CHANGED Viewed

@@ -1,97 +1,109 @@
----
-license: apache-2.0
-datasets:
-- HPAI-BSC/Egida
-language:
-- en
-base_model:
-- Qwen/Qwen2.5-72B-Instruct
-tags:
-- safety
----
-<div align="center" style="line-height: 1;">
-  <a href="https://arxiv.org/abs/2502.13603" target="_blank" style="margin: 2px;">
-    <img alt="Paper" src="https://img.shields.io/badge/arXiv-2502.13603-b31b1b.svg" style="display: inline-block; vertical-align: middle;"/>
-  </a>
-  <a href="https://huggingface.co/collections/HPAI-BSC/egida-llm-safety-67b5b15d12bc9887d0045598" target="_blank" style="margin: 2px;">
-    <img alt="Egida Collection" src="https://img.shields.io/badge/Egida_Collection-Hugging%20Face-FFD21E?logo=huggingface" style="display: inline-block; vertical-align: middle;"/>
-  </a>
-  <a href="https://hpai.bsc.es/" target="_blank" style="margin: 2px;">
-    <img alt="HPAI Website" src="https://img.shields.io/badge/HPAI-Website-blue" style="display: inline-block; vertical-align: middle;"/>
-  </a>
-  <a href="https://www.linkedin.com/company/hpai" target="_blank" style="margin: 2px;">
-    <img alt="LinkedIn" src="https://custom-icon-badges.demolab.com/badge/LinkedIn-0A66C2?logo=linkedin-white&logoColor=fff" style="display: inline-block; vertical-align: middle;"/>
-  </a>
-  <a href="https://bsky.app/profile/hpai.bsky.social" target="_blank" style="margin: 2px;">
-    <img alt="Bluesky" src="https://img.shields.io/badge/Bluesky-0285FF?logo=bluesky&logoColor=fff" style="display: inline-block; vertical-align: middle;"/>
-  </a>
-</div>
-## Model Description
-- **Fine-Tuned from Model:** [Qwen/Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct)
-- **Paper:** [Efficient Safety Retrofitting Against Jailbreaking for LLMs](https://arxiv.org/abs/2502.13603)
-- **Point of Contact:** [Adrián Tormos](mailto:[email protected])
-## Model Summary
-This is a fine-tuned Qwen2.5-72B-Instruct model on the [Egida-DPO-Qwen2.5-72B-Instruct](https://huggingface.co/datasets/HPAI-BSC/Egida/viewer/Egida-DPO-Qwen2.5-72B-Instruct) dataset.
-The [Egida](https://huggingface.co/datasets/HPAI-BSC/Egida/viewer/Egida?views%5B%5D=egida_full) dataset is a collection of adversarial prompts that are thought to ellicit unsafe behaviors from language models. Specifically for this case, the Egida train split is used to run inference on Qwen2.5-72B-Instruct. Unsafe answers are selected, and paired with safe answers to create a customized DPO
-dataset for this model. This results in a DPO dataset composed by triplets < ”question”, ”chosen answer”, ”discarded answer” > which contain questions that elicit unsafe responses by this target model, as well as the unsafe responses produced by it.
-## Training Details
-- **Hardware:** NVIDIA H100 64 GB GPUs
-- **Devices:** 64 GPUs (16 nodes)
-- **Time:** 10.23h
-- **Batch Size:** 63
-- **LR:** 10−6
-## Performance
-### Safety Performance (Attack Success Ratio)
-|                              | Egida (test) ↓ | DELPHI ↓ | Alert-Base ↓ | Alert-Adv ↓ |
-|------------------------------|:--------------:|:--------:|:------------:|:-----------:|
-| Qwen-2.5-72B-Instruct        |     0.235      |  0.051   |    0.329     |    0.050    |
-| Qwen-2.5-72B-Instruct-Egida-DPO       |     0.125      |  0.042   |    0.210     |    0.019    |
-### General Purpose Performance
-|                              | OpenLLM Leaderboard (Average) ↑ | MMLU Generative (ROUGE1) ↑ |
-|------------------------------|:---------------------:|:---------------:|
-| Qwen-2.5-72B-Instruct        |         0.618         |      0.771      |
-| Qwen-2.5-72B-Instruct-Egida-DPO       |         0.620         |      0.768      |
-### Refusal Ratio
-|                              | OR Bench 80K (refusal) ↓ | OR Bench Hard (refusal) ↓ |
-|------------------------------|:---------------------:|:---------------:|
-| Qwen-2.5-72B-Instruct         |          0.015           |           0.102           |
-| Qwen-2.5-72B-Instruct-Egida-DPO        |          0.016           |           0.170           |
-Note that this refusal ratio is computed as keyword matching with a curated list of keywords. For more information, check the paper.
-## Environmental Impact
-## Citation Information
-```
-@misc{garciagasulla2025efficientsafetyretrofittingjailbreaking,
-      title={Efficient Safety Retrofitting Against Jailbreaking for LLMs},
-      author={Dario Garcia-Gasulla and Adrian Tormos and Anna Arias-Duart and Daniel Hinjos and Oscar Molina-Sedano and Ashwin Kumar Gururajan and Maria Eugenia Cardello},
-      year={2025},
-      eprint={2502.13603},
-      archivePrefix={arXiv},
-      primaryClass={cs.CL},
-      url={https://arxiv.org/abs/2502.13603},
-}
 ```

+---
+license: apache-2.0
+datasets:
+- HPAI-BSC/Egida
+language:
+- zho
+- eng
+- fra
+- spa
+- por
+- deu
+- ita
+- rus
+- jpn
+- kor
+- vie
+- tha
+- ara
+base_model:
+- Qwen/Qwen2.5-72B-Instruct
+tags:
+- safety
+---
+<div align="center" style="line-height: 1;">
+  <a href="https://arxiv.org/abs/2502.13603" target="_blank" style="margin: 2px;">
+    <img alt="Paper" src="https://img.shields.io/badge/arXiv-2502.13603-b31b1b.svg" style="display: inline-block; vertical-align: middle;"/>
+  </a>
+  <a href="https://huggingface.co/collections/HPAI-BSC/egida-llm-safety-67b5b15d12bc9887d0045598" target="_blank" style="margin: 2px;">
+    <img alt="Egida Collection" src="https://img.shields.io/badge/Egida_Collection-Hugging%20Face-FFD21E?logo=huggingface" style="display: inline-block; vertical-align: middle;"/>
+  </a>
+  <a href="https://hpai.bsc.es/" target="_blank" style="margin: 2px;">
+    <img alt="HPAI Website" src="https://img.shields.io/badge/HPAI-Website-blue" style="display: inline-block; vertical-align: middle;"/>
+  </a>
+  <a href="https://www.linkedin.com/company/hpai" target="_blank" style="margin: 2px;">
+    <img alt="LinkedIn" src="https://custom-icon-badges.demolab.com/badge/LinkedIn-0A66C2?logo=linkedin-white&logoColor=fff" style="display: inline-block; vertical-align: middle;"/>
+  </a>
+  <a href="https://bsky.app/profile/hpai.bsky.social" target="_blank" style="margin: 2px;">
+    <img alt="Bluesky" src="https://img.shields.io/badge/Bluesky-0285FF?logo=bluesky&logoColor=fff" style="display: inline-block; vertical-align: middle;"/>
+  </a>
+</div>
+## Model Description
+- **Fine-Tuned from Model:** [Qwen/Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct)
+- **Paper:** [Efficient Safety Retrofitting Against Jailbreaking for LLMs](https://arxiv.org/abs/2502.13603)
+- **Point of Contact:** [Adrián Tormos](mailto:[email protected])
+## Model Summary
+This is a fine-tuned Qwen2.5-72B-Instruct model on the [Egida-DPO-Qwen2.5-72B-Instruct](https://huggingface.co/datasets/HPAI-BSC/Egida/viewer/Egida-DPO-Qwen2.5-72B-Instruct) dataset.
+The [Egida](https://huggingface.co/datasets/HPAI-BSC/Egida/viewer/Egida?views%5B%5D=egida_full) dataset is a collection of adversarial prompts that are thought to ellicit unsafe behaviors from language models. Specifically for this case, the Egida train split is used to run inference on Qwen2.5-72B-Instruct. Unsafe answers are selected, and paired with safe answers to create a customized DPO
+dataset for this model. This results in a DPO dataset composed by triplets < ”question”, ”chosen answer”, ”discarded answer” > which contain questions that elicit unsafe responses by this target model, as well as the unsafe responses produced by it.
+## Training Details
+- **Hardware:** NVIDIA H100 64 GB GPUs
+- **Devices:** 64 GPUs (16 nodes)
+- **Time:** 10.23h
+- **Batch Size:** 63
+- **LR:** 10−6
+## Performance
+### Safety Performance (Attack Success Ratio)
+|                              | Egida (test) ↓ | DELPHI ↓ | Alert-Base ↓ | Alert-Adv ↓ |
+|------------------------------|:--------------:|:--------:|:------------:|:-----------:|
+| Qwen-2.5-72B-Instruct        |     0.235      |  0.051   |    0.329     |    0.050    |
+| Qwen-2.5-72B-Instruct-Egida-DPO       |     0.125      |  0.042   |    0.210     |    0.019    |
+### General Purpose Performance
+|                              | OpenLLM Leaderboard (Average) ↑ | MMLU Generative (ROUGE1) ↑ |
+|------------------------------|:---------------------:|:---------------:|
+| Qwen-2.5-72B-Instruct        |         0.618         |      0.771      |
+| Qwen-2.5-72B-Instruct-Egida-DPO       |         0.620         |      0.768      |
+### Refusal Ratio
+|                              | OR Bench 80K (refusal) ↓ | OR Bench Hard (refusal) ↓ |
+|------------------------------|:---------------------:|:---------------:|
+| Qwen-2.5-72B-Instruct         |          0.015           |           0.102           |
+| Qwen-2.5-72B-Instruct-Egida-DPO        |          0.016           |           0.170           |
+Note that this refusal ratio is computed as keyword matching with a curated list of keywords. For more information, check the paper.
+## Environmental Impact
+## Citation Information
+```
+@misc{garciagasulla2025efficientsafetyretrofittingjailbreaking,
+      title={Efficient Safety Retrofitting Against Jailbreaking for LLMs},
+      author={Dario Garcia-Gasulla and Adrian Tormos and Anna Arias-Duart and Daniel Hinjos and Oscar Molina-Sedano and Ashwin Kumar Gururajan and Maria Eugenia Cardello},
+      year={2025},
+      eprint={2502.13603},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2502.13603},
+}
 ```