Safetensors
English
llama
danihinjos commited on
Commit
a2697e2
·
verified ·
1 Parent(s): a1362bc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -0
README.md CHANGED
@@ -16,6 +16,22 @@ This is a fine-tuned Llama-3.1-8B-Instruct model on the [Egida-DPO-Llama-3.1-8B-
16
  The [Egida](https://huggingface.co/datasets/HPAI-BSC/Egida/viewer/Egida?views%5B%5D=egida_full) dataset is a collection of adversarial prompts that are thought to ellicit unsafe behaviors from language models. Specifically for this case, the Egida train split is used to run inference on Llama-3.1-70B-Instruct. Unsafe answers are selected, and paired with safe answers to create a customized DPO
17
  dataset for this model. This results in a DPO dataset composed by triplets < ”question”, ”chosen answer”, ”discarded answer” > which contain questions that elicit unsafe responses by this target model, as well as the unsafe responses produced by it.
18
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
  ## Training Details
20
 
21
  - **Hardware:** NVIDIA H100 64 GB GPUs
 
16
  The [Egida](https://huggingface.co/datasets/HPAI-BSC/Egida/viewer/Egida?views%5B%5D=egida_full) dataset is a collection of adversarial prompts that are thought to ellicit unsafe behaviors from language models. Specifically for this case, the Egida train split is used to run inference on Llama-3.1-70B-Instruct. Unsafe answers are selected, and paired with safe answers to create a customized DPO
17
  dataset for this model. This results in a DPO dataset composed by triplets < ”question”, ”chosen answer”, ”discarded answer” > which contain questions that elicit unsafe responses by this target model, as well as the unsafe responses produced by it.
18
 
19
+ ## Performance
20
+
21
+ ### Safety Performance (Attack Success Ratio)
22
+
23
+ | | Egida (test) ↓ | DELPHI ↓ | Alert-Base ↓ | Alert-Adv ↓ |
24
+ |------------------------------|:--------------:|:--------:|:------------:|:-----------:|
25
+ | Meta-Llama-3.1-8B-Instruct | 0.347 | 0.160 | 0.446 | 0.039 |
26
+ | Meta-Llama-3.1-8B-Egida-DPO | 0.038 | 0.025 | 0.038 | 0.014 |
27
+
28
+ ### General Purpose Performance
29
+
30
+ | | OpenLLM Leaderboard (Average) ↑ | MMLU (ROUGE1) ↑ |
31
+ |------------------------------|:---------------------:|:---------------:|
32
+ | Meta-Llama-3.1-8B-Instruct | 0.453 | 0.646 |
33
+ | Meta-Llama-3.1-8B-Egida-DPO | 0.453 | 0.643 |
34
+
35
  ## Training Details
36
 
37
  - **Hardware:** NVIDIA H100 64 GB GPUs