--- license: apache-2.0 datasets: - HPAI-BSC/Egida language: - en base_model: - meta-llama/Llama-3.1-8B-Instruct ---

## Model Description - **Fine-Tuned from Model:** [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) - **Paper:** [Efficient Safety Retrofitting Against Jailbreaking for LLMs](https://arxiv.org/abs/2502.13603) - **Point of Contact:** [Adrián Tormos](mailto:adrian.tormos@bsc.es) ## Model Summary This is a fine-tuned Llama-3.1-8B-Instruct model on the [Egida-DPO-Llama-3.1-8B-Instruct](https://huggingface.co/datasets/HPAI-BSC/Egida/viewer/Egida-DPO-Meta-Llama-3.1-8B-Instruct) dataset. The [Egida](https://huggingface.co/datasets/HPAI-BSC/Egida/viewer/Egida?views%5B%5D=egida_full) dataset is a collection of adversarial prompts that are thought to ellicit unsafe behaviors from language models. Specifically for this case, the Egida train split is used to run inference on Llama-3.1-70B-Instruct. Unsafe answers are selected, and paired with safe answers to create a customized DPO dataset for this model. This results in a DPO dataset composed by triplets < ”question”, ”chosen answer”, ”discarded answer” > which contain questions that elicit unsafe responses by this target model, as well as the unsafe responses produced by it. ## Performance ### Safety Performance (Attack Success Ratio) | | Egida (test) ↓ | DELPHI ↓ | Alert-Base ↓ | Alert-Adv ↓ | |------------------------------|:--------------:|:--------:|:------------:|:-----------:| | Meta-Llama-3.1-8B-Instruct | 0.347 | 0.160 | 0.446 | 0.039 | | Meta-Llama-3.1-8B-Instruct-Egida-DPO | 0.038 | 0.025 | 0.038 | 0.014 | ### General Purpose Performance | | OpenLLM Leaderboard (Average) ↑ | MMLU Generative (ROUGE1) ↑ | |------------------------------|:---------------------:|:---------------:| | Meta-Llama-3.1-8B-Instruct | 0.453 | 0.646 | | Meta-Llama-3.1-8B-Instruct-Egida-DPO | 0.453 | 0.643 | ### Refusal Ratio | | OR Bench 80K (refusal) ↓ | OR Bench Hard (refusal) ↓ | |------------------------------|:---------------------:|:---------------:| | Meta-Llama-3.1-8B-Instruct | 0.035 | 0.324 | | Meta-Llama-3.1-8B-Instruct-Egida-DPO | 0.037 | 0.319 | Note that this refusal ratio is computed as keyword matching with a curated list of keywords. For more information, check the paper. ## Training Details - **Hardware:** NVIDIA H100 64 GB GPUs - **Devices:** 4 GPUs (1 node) - **Time:** 1.59h - **Batch Size:** 8 - **LR:** 10−7 ## Environmental Impact ## Citation Information ``` @misc{garciagasulla2025efficientsafetyretrofittingjailbreaking, title={Efficient Safety Retrofitting Against Jailbreaking for LLMs}, author={Dario Garcia-Gasulla and Adrian Tormos and Anna Arias-Duart and Daniel Hinjos and Oscar Molina-Sedano and Ashwin Kumar Gururajan and Maria Eugenia Cardello}, year={2025}, eprint={2502.13603}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2502.13603}, } ```