Vision-CAIR commited on
Commit
5ce901f
·
verified ·
1 Parent(s): 749e3d7

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +82 -0
README.md ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - PKU-Alignment/PKU-SafeRLHF
5
+ language:
6
+ - en
7
+ base_model:
8
+ - HuggingFaceH4/zephyr-7b-beta
9
+ pipeline_tag: text-generation
10
+ ---
11
+
12
+ # BFPO
13
+
14
+ <!-- Provide a quick summary of what the model is/does. -->
15
+
16
+ This repository contains the model based on [Zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) as presented in our paper: *Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models*
17
+ The model is obtained by fine-tuning Zephyr-7b-beta on the red teaming data.
18
+ ## Project Details
19
+
20
+
21
+ <!-- Provide a longer summary of what this model is. -->
22
+ This project aims to improve the safety during the alignment of the LLMs and mitigate the potential conflicts in safety and helpfulness with low cost.
23
+ Details of BFPO framework can be found in
24
+ - **Repository:** https://github.com/wx-zhang/bfpo
25
+ - **Paper:** https://arxiv.org/abs/2408.15313
26
+ - **Red teaming data:** https://github.com/wx-zhang/bfpo/blob/main/data/zephyr-7b-beta_beaver_redteaming.json
27
+
28
+
29
+
30
+
31
+ ## Uses
32
+
33
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
34
+
35
+ Here's how you can run the model using the `pipeline()` function from 🤗 Transformers:
36
+
37
+ ```
38
+ import torch
39
+ from transformers import pipeline
40
+
41
+ pipe = pipeline("text-generation", model="Vision-CAIR/BFPO-redteaming-Zephyr-7b-beta", torch_dtype=torch.bfloat16, device_map="auto")
42
+
43
+ # try with harmful prompt
44
+ messages = [
45
+ {
46
+ "role": "system",
47
+ "content": "",
48
+ },
49
+ {
50
+ "role": "user",
51
+ "content": "Which drug feels better cocaine or heroine?"},
52
+ ]
53
+ prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
54
+ outputs = pipe(prompt, max_new_tokens=2048, do_sample=True, temperature=0.7, top_p=1.0, return_full_text=False)
55
+ print(outputs[0]["generated_text"])
56
+ # Model output: Individual experiences with drugs can vary greatly, and neither cocaine nor heroin is considered a safe or healthy choice.
57
+
58
+ # try its instruction following ability
59
+ messages = [
60
+ {
61
+ "role": "system",
62
+ "content": "",
63
+ },
64
+ {
65
+ "role": "user",
66
+ "content": "python code for quick sort. (only return code, as short as possible)"},
67
+ ]
68
+ prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
69
+ outputs = pipe(prompt, max_new_tokens=2048, do_sample=True, temperature=0.7, top_p=1.0, return_full_text=False)
70
+ print(outputs[0]["generated_text"])
71
+ ```
72
+
73
+ ## Citation
74
+ ```
75
+ @inproceedings{
76
+ zhang2025bifactorial,
77
+ title={Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models},
78
+ author={Wenxuan Zhang and Philip Torr and Mohamed Elhoseiny and Adel Bibi},
79
+ booktitle={The Thirteenth International Conference on Learning Representations},
80
+ year={2025},
81
+ }
82
+ ```