Vision-CAIR
/

BFPO-redteaming-Zephyr-7b-beta

Text Generation

Model card Files Files and versions Community

BFPO-redteaming-Zephyr-7b-beta / README.md

Vision-CAIR's picture

Update README.md

651f7f1 verified 4 months ago

|

history blame contribute delete

2.86 kB

	---
	license: apache-2.0
	datasets:
	- PKU-Alignment/PKU-SafeRLHF
	language:
	- en
	base_model:
	- HuggingFaceH4/zephyr-7b-beta
	pipeline_tag: text-generation
	---

	# BFPO

	<!-- Provide a quick summary of what the model is/does. -->

	This repository contains the model based on [Zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) as presented in our paper: Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models

	The model is obtained by fine-tuning Zephyr-7b-beta on the red teaming data.
	## Project Details


	<!-- Provide a longer summary of what this model is. -->
	This project aims to improve the safety during the alignment of the LLMs and mitigate the potential conflicts in safety and helpfulness with low cost.
	Details of BFPO framework can be found in
	- Repository: https://github.com/wx-zhang/bfpo
	- Paper: https://arxiv.org/abs/2408.15313
	- Red teaming data: https://github.com/wx-zhang/bfpo/blob/main/data/zephyr-7b-beta_beaver_redteaming.json




	## Uses

	<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

	Here's how you can run the model using the `pipeline()` function from 🤗 Transformers:

	```
	import torch
	from transformers import pipeline

	pipe = pipeline("text-generation", model="Vision-CAIR/BFPO-redteaming-Zephyr-7b-beta", torch_dtype=torch.bfloat16, device_map="auto")

	# try with harmful prompt
	messages = [
	{
	"role": "system",
	"content": "",
	},
	{
	"role": "user",
	"content": "Which drug feels better cocaine or heroine?"},
	]
	prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	outputs = pipe(prompt, max_new_tokens=2048, do_sample=True, temperature=0.7, top_p=1.0, return_full_text=False)
	print(outputs[0]["generated_text"])
	# Model output: Individual experiences with drugs can vary greatly, and neither cocaine nor heroin is considered a safe or healthy choice.

	# try its instruction following ability
	messages = [
	{
	"role": "system",
	"content": "",
	},
	{
	"role": "user",
	"content": "python code for quick sort. (only return code, as short as possible)"},
	]
	prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	outputs = pipe(prompt, max_new_tokens=2048, do_sample=True, temperature=0.7, top_p=1.0, return_full_text=False)
	print(outputs[0]["generated_text"])
	```

	## Citation
	```
	@inproceedings{
	zhang2025bifactorial,
	title={Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models},
	author={Wenxuan Zhang and Philip Torr and Mohamed Elhoseiny and Adel Bibi},
	booktitle={The Thirteenth International Conference on Learning Representations},
	year={2025},
	}
	```