Update README.md
Browse files
README.md
CHANGED
@@ -16,6 +16,22 @@ This is a fine-tuned Llama-3.1-8B-Instruct model on the [Egida-DPO-Llama-3.1-8B-
|
|
16 |
The [Egida](https://huggingface.co/datasets/HPAI-BSC/Egida/viewer/Egida?views%5B%5D=egida_full) dataset is a collection of adversarial prompts that are thought to ellicit unsafe behaviors from language models. Specifically for this case, the Egida train split is used to run inference on Llama-3.1-70B-Instruct. Unsafe answers are selected, and paired with safe answers to create a customized DPO
|
17 |
dataset for this model. This results in a DPO dataset composed by triplets < ”question”, ”chosen answer”, ”discarded answer” > which contain questions that elicit unsafe responses by this target model, as well as the unsafe responses produced by it.
|
18 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
19 |
## Training Details
|
20 |
|
21 |
- **Hardware:** NVIDIA H100 64 GB GPUs
|
|
|
16 |
The [Egida](https://huggingface.co/datasets/HPAI-BSC/Egida/viewer/Egida?views%5B%5D=egida_full) dataset is a collection of adversarial prompts that are thought to ellicit unsafe behaviors from language models. Specifically for this case, the Egida train split is used to run inference on Llama-3.1-70B-Instruct. Unsafe answers are selected, and paired with safe answers to create a customized DPO
|
17 |
dataset for this model. This results in a DPO dataset composed by triplets < ”question”, ”chosen answer”, ”discarded answer” > which contain questions that elicit unsafe responses by this target model, as well as the unsafe responses produced by it.
|
18 |
|
19 |
+
## Performance
|
20 |
+
|
21 |
+
### Safety Performance (Attack Success Ratio)
|
22 |
+
|
23 |
+
| | Egida (test) ↓ | DELPHI ↓ | Alert-Base ↓ | Alert-Adv ↓ |
|
24 |
+
|------------------------------|:--------------:|:--------:|:------------:|:-----------:|
|
25 |
+
| Meta-Llama-3.1-8B-Instruct | 0.347 | 0.160 | 0.446 | 0.039 |
|
26 |
+
| Meta-Llama-3.1-8B-Egida-DPO | 0.038 | 0.025 | 0.038 | 0.014 |
|
27 |
+
|
28 |
+
### General Purpose Performance
|
29 |
+
|
30 |
+
| | OpenLLM Leaderboard (Average) ↑ | MMLU (ROUGE1) ↑ |
|
31 |
+
|------------------------------|:---------------------:|:---------------:|
|
32 |
+
| Meta-Llama-3.1-8B-Instruct | 0.453 | 0.646 |
|
33 |
+
| Meta-Llama-3.1-8B-Egida-DPO | 0.453 | 0.643 |
|
34 |
+
|
35 |
## Training Details
|
36 |
|
37 |
- **Hardware:** NVIDIA H100 64 GB GPUs
|