Safetensors
English
llama
danihinjos commited on
Commit
dcec3c0
·
verified ·
1 Parent(s): 01508fc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +27 -1
README.md CHANGED
@@ -2,7 +2,33 @@
2
  license: apache-2.0
3
  ---
4
 
5
- ### Citation Information
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
 
7
  ```
8
  @misc{garciagasulla2025efficientsafetyretrofittingjailbreaking,
 
2
  license: apache-2.0
3
  ---
4
 
5
+ ## Model Description
6
+
7
+ - **Fine-Tuned from Model:** [meta-llama/Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct)
8
+ - **Paper:** [Efficient Safety Retrofitting Against Jailbreaking for LLMs](https://arxiv.org/abs/2502.13603)
9
+ - **Point of Contact:** [Adrián Tormos](mailto:[email protected])
10
+
11
+
12
+ ## Model Summary
13
+
14
+ This is a fine-tuned Llama-3.1-70B-Instruct model on the [Egida-DPO-Llama-3.1-70B-Instruct](http://huggingface.co/datasets/HPAI-BSC/Egida/viewer/Egida-DPO-Meta-Llama-3.1-70B-Instruct) dataset.
15
+
16
+ The [Egida](https://huggingface.co/datasets/HPAI-BSC/Egida/viewer/Egida?views%5B%5D=egida_full) dataset is a collection of adversarial prompts that are thought to ellicit unsafe behaviors from language models. Specifically for this case, the Egida train split is used to run inference on Qwen2.5-7B-Instruct. Unsafe answers are selected, and paired with safe answers to create a customized DPO
17
+ dataset for this model. This results in a DPO dataset composed by triplets < ”question”, ”chosen answer”, ”discarded answer” > which contain questions that elicit unsafe responses by this target model, as well as the unsafe responses produced by it.
18
+
19
+ ## Training Details
20
+
21
+ - **Hardware:** NVIDIA H100 64 GB GPUs
22
+ - **Devices:** 64 GPUs (16 node)
23
+ - **Time:** 10.23h
24
+ - **Batch Size:** 64
25
+ - **LR:** 10−6
26
+
27
+ ## Environmental Impact
28
+
29
+
30
+ ## Citation Information
31
+
32
 
33
  ```
34
  @misc{garciagasulla2025efficientsafetyretrofittingjailbreaking,