Model Details

  • SFT based on meta-llama/Llama-2-7b-hf with merged alpaca datasets
  • DPO: trained on top of SFT model as LoRa Adapter, with merged hh-rlhf data
  • PPO: trained on top of dpo model and reward model, with multi-adapters, with PKU-SafeRLHF data for futher RLHF
  • Trained with Deepspeed ZeRO-1 + TRL + QLoRA + Flash-Attntion 2

Model and Training Details

Training Results

image/png

Evaluation

The reward score and toxicity scores are computed and compared with PKU-Alignment/PKU-SafeRLHF-30K data on SFT/DPO/PPO models

Model Toxicity Reward
SFT_v0.1 0.0698 -0.2828
DPO_v0.1 0.0356 -0.2633
PPO_v0.1 0.0321 0.38
image/png

Compute Infrastructure

The model is trained using 8 * RTX-3090-24GB/A100-PCIE-40GB

Inference

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, trust_remote_code=True,)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True,)

tokenizer.pad_token = tokenizer.eos_token
tokenizer.eos_token = DEFINE_EOS_TOKEN
model.config.eos_token = DEFINE_EOS_TOKEN
model.config.eos_token_id = tokenizer.eos_token_id

def format_prompt(question):
    return f"###Question: {question}\n###Answer: "

instruction = "Your text here"
input = format_prompt(instruction)
inputs = tokenizer(input, return_tensors='pt')
output = model.generate(inputs['input_ids'], max_new_tokens=512, do_sample=False, top_p=1)
output = tokenizer.decode(output[0], skip_special_tokens=True)
print(output)

Model Card Authors

Yiyu (Michael) Ren

Model Card Contact

Email: [email protected]

Framework versions

  • PEFT 0.8.2
Downloads last month
10
Safetensors
Model size
6.74B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for renyiyu/llama-2-7b-dpo-v0.1

Finetuned
(641)
this model