File size: 4,631 Bytes

1e2b5a2
 
8371788
 
 
 
 
 
 
 
1e2b5a2
 
ec470fd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ec5a436
 
af641e2
ec5a436
 
 
 
bb3d356
1e2b5a2
d9d62ce
 
af641e2
 
 
 
 
 
 
 
 
 
 
bbe3a39
 
 
 
 
 
 
48d4f32
bbe3a39
 
 
 
 
 
48d4f32
 
 
 
 
 
 
 
 
9a9bc40
48d4f32
bbe3a39
af641e2
 
d9d62ce
bb3d356
d9d62ce
 
1e2b5a2

---
license: apache-2.0
datasets:
- HPAI-BSC/Egida
language:
- en
base_model:
- Qwen/Qwen2.5-7B-Instruct
tags:
- safety
---

<div align="center" style="line-height: 1;">
  <a href="https://arxiv.org/abs/2502.13603" target="_blank" style="margin: 2px;">
    <img alt="Paper" src="https://img.shields.io/badge/arXiv-2502.13603-b31b1b.svg" style="display: inline-block; vertical-align: middle;"/>
  </a>
  <a href="https://huggingface.co/collections/HPAI-BSC/egida-llm-safety-67b5b15d12bc9887d0045598" target="_blank" style="margin: 2px;">
    <img alt="Egida Collection" src="https://img.shields.io/badge/Egida_Collection-Hugging%20Face-FFD21E?logo=huggingface" style="display: inline-block; vertical-align: middle;"/>
  </a>
  <a href="https://hpai.bsc.es/" target="_blank" style="margin: 2px;">
    <img alt="HPAI Website" src="https://img.shields.io/badge/HPAI-Website-blue" style="display: inline-block; vertical-align: middle;"/>
  </a>
  <a href="https://www.linkedin.com/company/hpai" target="_blank" style="margin: 2px;">
    <img alt="LinkedIn" src="https://custom-icon-badges.demolab.com/badge/LinkedIn-0A66C2?logo=linkedin-white&logoColor=fff" style="display: inline-block; vertical-align: middle;"/>
  </a>
  <a href="https://bsky.app/profile/hpai.bsky.social" target="_blank" style="margin: 2px;">
    <img alt="Bluesky" src="https://img.shields.io/badge/Bluesky-0285FF?logo=bluesky&logoColor=fff" style="display: inline-block; vertical-align: middle;"/>
  </a>
</div>

## Model Description

- **Fine-Tuned from Model:** [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)
- **Paper:** [Efficient Safety Retrofitting Against Jailbreaking for LLMs](https://arxiv.org/abs/2502.13603)
- **Point of Contact:** [Adrián Tormos](mailto:[email protected])


## Model Summary

This is a fine-tuned Qwen2.5-7B-Instruct model on the [Egida-DPO-Qwen2.5-7B-Instruct](https://huggingface.co/datasets/HPAI-BSC/Egida/viewer/Egida-DPO-Qwen2.5-7B-Instruct) dataset.

The [Egida](https://huggingface.co/datasets/HPAI-BSC/Egida/viewer/Egida?views%5B%5D=egida_full) dataset is a collection of adversarial prompts that are thought to ellicit unsafe behaviors from language models. Specifically for this case, the Egida train split is used to run inference on Qwen2.5-7B-Instruct. Unsafe answers are selected, and paired with safe answers to create a customized DPO
dataset for this model. This results in a DPO dataset composed by triplets < ”question”, ”chosen answer”, ”discarded answer” > which contain questions that elicit unsafe responses by this target model, as well as the unsafe responses produced by it.

## Training Details

- **Hardware:** NVIDIA H100 64 GB GPUs
- **Devices:** 4 GPUs (1 node)
- **Time:** 1.59h
- **Batch Size:** 8
- **LR:** 10−7

## Performance

### Safety Performance (Attack Success Ratio)

|                              | Egida (test) ↓ | DELPHI ↓ | Alert-Base ↓ | Alert-Adv ↓ |
|------------------------------|:--------------:|:--------:|:------------:|:-----------:|
| Qwen-2.5-7B-Instruct         |     0.471      |  0.138   |    0.544     |    0.080    |
| Qwen-2.5-7B-Instruct-Egida-DPO        |     0.322      |  0.118   |    0.410     |    0.045    |

### General Purpose Performance

|                              | OpenLLM Leaderboard (Average) ↑ | MMLU Generative (ROUGE1) ↑ |
|------------------------------|:---------------------:|:---------------:|
| Qwen-2.5-7B-Instruct         |         0.488         |      0.331      |
| Qwen-2.5-7B-Instruct-Egida-DPO        |         0.488         |      0.296      |

### Refusal Ratio

|                              | OR Bench 80K (refusal) ↓ | OR Bench Hard (refusal) ↓ |
|------------------------------|:---------------------:|:---------------:|
| Qwen-2.5-7B-Instruct         |          0.021           |           0.175           |
| Qwen-2.5-7B-Instruct-Egida-DPO        |          0.029           |           0.240           |

Note that this refusal ratio is computed as keyword matching with a curated list of keywords. For more information, check the paper.


## Environmental Impact


## Citation Information


```
@misc{garciagasulla2025efficientsafetyretrofittingjailbreaking,
      title={Efficient Safety Retrofitting Against Jailbreaking for LLMs}, 
      author={Dario Garcia-Gasulla and Adrian Tormos and Anna Arias-Duart and Daniel Hinjos and Oscar Molina-Sedano and Ashwin Kumar Gururajan and Maria Eugenia Cardello},
      year={2025},
      eprint={2502.13603},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.13603}, 
}
```