File size: 4,665 Bytes
cceb6e3 8afc19e cceb6e3 ffbb6d3 5f5c9ec ffbb6d3 0162d59 ffbb6d3 a1362bc ffbb6d3 a2697e2 f408278 a2697e2 b6bd0c6 a2697e2 f408278 8afc19e a2697e2 ffbb6d3 cceb6e3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
---
license: apache-2.0
datasets:
- HPAI-BSC/Egida
language:
- en
base_model:
- meta-llama/Llama-3.1-8B-Instruct
---
<div align="center" style="line-height: 1;">
<a href="https://arxiv.org/abs/2502.13603" target="_blank" style="margin: 2px;">
<img alt="Paper" src="https://img.shields.io/badge/arXiv-2502.13603-b31b1b.svg" style="display: inline-block; vertical-align: middle;"/>
</a>
<a href="https://huggingface.co/collections/HPAI-BSC/egida-llm-safety-67b5b15d12bc9887d0045598" target="_blank" style="margin: 2px;">
<img alt="Egida Collection" src="https://img.shields.io/badge/Egida_Collection-Hugging%20Face-FFD21E?logo=huggingface" style="display: inline-block; vertical-align: middle;"/>
</a>
<a href="https://hpai.bsc.es/" target="_blank" style="margin: 2px;">
<img alt="HPAI Website" src="https://img.shields.io/badge/HPAI-Website-blue" style="display: inline-block; vertical-align: middle;"/>
</a>
<a href="https://www.linkedin.com/company/hpai" target="_blank" style="margin: 2px;">
<img alt="LinkedIn" src="https://custom-icon-badges.demolab.com/badge/LinkedIn-0A66C2?logo=linkedin-white&logoColor=fff" style="display: inline-block; vertical-align: middle;"/>
</a>
<a href="https://bsky.app/profile/hpai.bsky.social" target="_blank" style="margin: 2px;">
<img alt="Bluesky" src="https://img.shields.io/badge/Bluesky-0285FF?logo=bluesky&logoColor=fff" style="display: inline-block; vertical-align: middle;"/>
</a>
</div>
## Model Description
- **Fine-Tuned from Model:** [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
- **Paper:** [Efficient Safety Retrofitting Against Jailbreaking for LLMs](https://arxiv.org/abs/2502.13603)
- **Point of Contact:** [Adrián Tormos](mailto:[email protected])
## Model Summary
This is a fine-tuned Llama-3.1-8B-Instruct model on the [Egida-DPO-Llama-3.1-8B-Instruct](https://huggingface.co/datasets/HPAI-BSC/Egida/viewer/Egida-DPO-Meta-Llama-3.1-8B-Instruct) dataset.
The [Egida](https://huggingface.co/datasets/HPAI-BSC/Egida/viewer/Egida?views%5B%5D=egida_full) dataset is a collection of adversarial prompts that are thought to ellicit unsafe behaviors from language models. Specifically for this case, the Egida train split is used to run inference on Llama-3.1-70B-Instruct. Unsafe answers are selected, and paired with safe answers to create a customized DPO
dataset for this model. This results in a DPO dataset composed by triplets < ”question”, ”chosen answer”, ”discarded answer” > which contain questions that elicit unsafe responses by this target model, as well as the unsafe responses produced by it.
## Performance
### Safety Performance (Attack Success Ratio)
| | Egida (test) ↓ | DELPHI ↓ | Alert-Base ↓ | Alert-Adv ↓ |
|------------------------------|:--------------:|:--------:|:------------:|:-----------:|
| Meta-Llama-3.1-8B-Instruct | 0.347 | 0.160 | 0.446 | 0.039 |
| Meta-Llama-3.1-8B-Instruct-Egida-DPO | 0.038 | 0.025 | 0.038 | 0.014 |
### General Purpose Performance
| | OpenLLM Leaderboard (Average) ↑ | MMLU Generative (ROUGE1) ↑ |
|------------------------------|:---------------------:|:---------------:|
| Meta-Llama-3.1-8B-Instruct | 0.453 | 0.646 |
| Meta-Llama-3.1-8B-Instruct-Egida-DPO | 0.453 | 0.643 |
### Refusal Ratio
| | OR Bench 80K (refusal) ↓ | OR Bench Hard (refusal) ↓ |
|------------------------------|:---------------------:|:---------------:|
| Meta-Llama-3.1-8B-Instruct | 0.035 | 0.324 |
| Meta-Llama-3.1-8B-Instruct-Egida-DPO | 0.037 | 0.319 |
Note that this refusal ratio is computed as keyword matching with a curated list of keywords. For more information, check the paper.
## Training Details
- **Hardware:** NVIDIA H100 64 GB GPUs
- **Devices:** 4 GPUs (1 node)
- **Time:** 1.59h
- **Batch Size:** 8
- **LR:** 10−7
## Environmental Impact
## Citation Information
```
@misc{garciagasulla2025efficientsafetyretrofittingjailbreaking,
title={Efficient Safety Retrofitting Against Jailbreaking for LLMs},
author={Dario Garcia-Gasulla and Adrian Tormos and Anna Arias-Duart and Daniel Hinjos and Oscar Molina-Sedano and Ashwin Kumar Gururajan and Maria Eugenia Cardello},
year={2025},
eprint={2502.13603},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.13603},
}
``` |