🦫 Beaver's Reward Model

Model Details

The Beaver reward model is a preference model trained using the PKU-SafeRLHF dataset. It can play a role in the safe RLHF algorithm, helping the Beaver model become more helpful.

Developed by: the PKU-Alignment Team.
Model Type: An auto-regressive language model based on the transformer architecture.
License: Non-commercial license.
Fine-tuned from model: LLaMA, Alpaca.

Model Sources

Repository: https://github.com/PKU-Alignment/safe-rlhf
Beaver: https://huggingface.co/PKU-Alignment/beaver-7b-v3.0
Dataset: https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF
Reward Model: https://huggingface.co/PKU-Alignment/beaver-7b-v3.0-reward
Cost Model: https://huggingface.co/PKU-Alignment/beaver-7b-v3.0-cost
Dataset Paper: https://arxiv.org/abs/2307.04657
Paper: https://arxiv.org/abs/2310.12773

How to Use the Reward Model

import torch
from transformers import AutoTokenizer
from safe_rlhf.models import AutoModelForScore

model = AutoModelForScore.from_pretrained('PKU-Alignment/beaver-7b-v3.0-reward', torch_dtype=torch.bfloat16, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained('PKU-Alignment/beaver-7b-v3.0-reward')

input = 'BEGINNING OF CONVERSATION: USER: hello ASSISTANT:Hello! How can I help you today?'

input_ids = tokenizer(input, return_tensors='pt')
output = model(**input_ids)
print(output)

# ScoreModelOutput(
#     scores=tensor([[[-14.0000],
#          [ -2.6094],
#          [ -2.6562],
#          [ -2.0312],
#          [ -1.2188],
#          [ -1.6250],
#          [ -2.4688],
#          [ -2.7500],
#          [ -3.0000],
#          [ -6.0000],
#          [ -5.0625],
#          [ -7.0938],
#          [ -6.9688],
#          [ -4.3125],
#          [ -4.2188],
#          [ -3.7969],
#          [ -3.6875],
#          [ -3.3750],
#          [ -2.8906],
#          [ -3.9219],
#          [ -2.1406],
#          [ -1.7578],
#          [  0.4629],
#          [  2.1719],
#          [  4.4062],
#          [  7.1562],
#          [  7.7188],
#          [ 10.7500]]], grad_fn=<ToCopyBackward0>),
#     end_scores=tensor([[10.7500]], grad_fn=<ToCopyBackward0>),
#     last_hidden_state=tensor([[[ 0.4805, -0.4863, -0.9258,  ..., -0.0718,  0.8555,  0.6641],
#          [ 0.2021,  2.0156,  3.5156,  ..., -0.9844, -1.1484,  1.3203],
#          [ 1.0938,  1.4609,  1.7891,  ..., -3.2031, -0.8555, -1.2969],
#          ...,
#          [ 1.5859,  0.1348,  0.0322,  ..., -1.3672, -1.5234,  1.5156],
#          [ 0.9102,  0.6367, -0.8555,  ..., -1.2109, -0.6953,  1.5312],
#          [ 1.7188,  0.4434, -0.5586,  ..., -1.1484, -0.7461,  2.2031]]],
#        dtype=torch.bfloat16, grad_fn=<ToCopyBackward0>),
#     end_last_hidden_state=tensor([[ 1.7188,  0.4434, -0.5586,  ..., -1.1484, -0.7461,  2.2031]],
#        dtype=torch.bfloat16, grad_fn=<ToCopyBackward0>),
#     end_index=tensor([27])
# )

PKU-Alignment
/

beaver-7b-v3.0-reward

🦫 Beaver's Reward Model

Model Details

Model Sources

How to Use the Reward Model

Dataset used to train PKU-Alignment/beaver-7b-v3.0-reward