Model Details
- SFT based on meta-llama/Llama-2-7b-hf with merged alpaca datasets
- DPO: trained on top of SFT model as LoRa Adapter, with merged hh-rlhf data
- PPO: trained on top of dpo model and reward model, with multi-adapters, with PKU-SafeRLHF data for futher RLHF
- Trained with Deepspeed ZeRO-1 + TRL + QLoRA + Flash-Attntion 2
Model and Training Details
Finetuned from model: meta-llama/Llama-2-7b-hf
Dataset:
- SFT (mixed train):
- DPO (mixed train):
- PPO:
Training Results
Evaluation
The reward score and toxicity scores are computed and compared with PKU-Alignment/PKU-SafeRLHF-30K data on SFT/DPO/PPO models
Compute Infrastructure
The model is trained using 8 * RTX-3090-24GB/A100-PCIE-40GB
Inference
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, trust_remote_code=True,)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True,)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.eos_token = DEFINE_EOS_TOKEN
model.config.eos_token = DEFINE_EOS_TOKEN
model.config.eos_token_id = tokenizer.eos_token_id
def format_prompt(question):
return f"###Question: {question}\n###Answer: "
instruction = "Your text here"
input = format_prompt(instruction)
inputs = tokenizer(input, return_tensors='pt')
output = model.generate(inputs['input_ids'], max_new_tokens=512, do_sample=False, top_p=1)
output = tokenizer.decode(output[0], skip_special_tokens=True)
print(output)
Model Card Authors
Yiyu (Michael) Ren
Model Card Contact
Email: [email protected]
Framework versions
- PEFT 0.8.2
- Downloads last month
- 10
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Model tree for renyiyu/llama-2-7b-dpo-v0.1
Base model
meta-llama/Llama-2-7b-hf