Model Card for Model ID
This model is a version of the mistralai/Mistral-7B-Instruct-v0.2 base model, further aligned using Reinforcement Learning from Human Feedback (RLHF) on the Anthropic/hh-rlhf dataset.
Model Details
Model Description
This model is a fine-tuned version of mistralai/Mistral-7B-Instruct-v0.2, aligned using Reinforcement Learning from Human Feedback (RLHF).
The alignment process was conducted in two stages using 20,000 samples from the Anthropic/hh-rlhf dataset:
Reward Model (RM) Training: A reward model was first trained on top of the base model. This model was loaded as AutoModelForSequenceClassification with a single label to output a score. This RM was trained using PEFT/LoRA (r=16) to distinguish between "chosen" and "rejected" responses, optimizing a log-sigmoid loss function to maximize the margin between the scores.
Proximal Policy Optimization (PPO) Training: The mistralai/Mistral-7B-Instruct-v0.2 model was then trained as a "policy model" (AutoModelForCausalLM) using a manual PPO loop. This stage also used PEFT/LoRA (r=16). The PPO loop optimized the policy model by:
Generating responses to prompts from the dataset.
Scoring these responses using the frozen Reward Model from Stage 1.
Calculating a KL-divergence penalty against a frozen "reference model" (the original, un-tuned base model).
The model was optimized to maximize the reward score while minimizing this KL divergence (deviation).
The final model is the result of this PPO training, saved as ppo_policy_model, and intended to be merged with the base model.
- Developed by: chachinggg / https://github.com/yberkayozkan
- Model type: Causal Language Model
- Language(s) (NLP): en
- License: apache-2.0
- Finetuned from model [optional]: mistralai/Mistral-7B-Instruct-v0.2
Model Sources
- Repository: (https://github.com/yberkayozkan/Mistral-ai-RLHF)
Uses
This model is intended for direct use as a helpful and harmless conversational assistant, aligned with human preferences. The primary goal of the RLHF training using the Anthropic/hh-rlhf dataset is to improve the safety and alignment of the mistralai/Mistral-7B-Instruct-v0.2 base model.
Out-of-Scope Use
This model is not intended for any use that seeks to bypass its safety alignment.
Malicious Use: The model should not be used for generating harmful, dangerous, unethical, or toxic content. The very purpose of its training is to prevent such outputs.
De-Alignment: The model should not be used in any process that attempts to reverse its safety training (e.g., fine-tuning it on harmful data to remove its refusal mechanisms).
High-Stakes Decisions: As with any large language model, it should not be used as the sole basis for high-stakes decisions in domains like law, medicine, or finance without human oversight.
The Anthropic/hh-rlhf dataset itself contains offensive and upsetting content for the express purpose of training models to avoid generating it. This model is a product of that process and should be used in accordance with that goal.
How to Get Started with the Model
Use the code below to get started with the model.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Your model's ID on the Hugging Face Hub
model_id = "chachinggg/Mistral-ai-RLHF"
# Load the tokenizer
# (The padding settings from cell 3 of your notebook
# should be saved with the model)
tokenizer = AutoTokenizer.from_pretrained(Mistral-7B-Instruct-v0.2)
# Load the model (the base model, not the classification head from cell 5)
# Use the bfloat16 dtype as specified in your notebook
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Format the prompt using the Mistral-Instruct template
prompt_text = "What is the best way to learn about RLHF?"
formatted_prompt = f"<s>[INST] {prompt_text} [/INST]"
# Tokenize the input
inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)
# Generate the response
# (Settings are based on cell 6 & 7)
outputs = model.generate(
**inputs,
max_new_tokens=128,
do_sample=True,
temperature=0.7, # 1.0 from cell 6 can be noisy, 0.7 is a common default
top_p=0.9,
pad_token_id=tokenizer.eos_token_id
)
# Decode and print the output
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Training Details
Training Data
The training process consisted of two stages:
A Reward Model (RM) was trained based on the "chosen" and "rejected" responses from the dataset.
Using this RM, the base model was fine-tuned with Proximal Policy Optimization (PPO) to align its responses with safety and helpfulness preferences.
Training Procedure
The model was trained in a two-stage process: first, a Reward Model (RM) was trained, and second, the base model was fine-tuned using PPO with insights from the RM.
Preprocessing
Data Loading: 20,000 samples were loaded from the train split of the Anthropic/hh-rlhf dataset. Tokenizer: The mistralai/Mistral-7B-Instruct-v0.2 tokenizer was used. Left-padding was enabled by setting tokenizer.pad_token = tokenizer.eos_token and tokenizer.padding_side = "left". Data Parsing: A function (split_prompt_response) was used to parse the raw text (containing \n\nHuman: and \n\nAssistant:) into separate prompt and response strings.
Formatting:
For the Reward Model, prompts and responses were formatted as [INST] {prompt} [/INST] {response}
For the PPO Model, only the prompts were used, formatted as [INST] {prompt} [/INST].
Tokenization:
RM Dataset: Both chosen and rejected formatted texts were tokenized with a max_length of 1024.
PPO Dataset: The formatted prompts were tokenized with a max_length of 512.
Training Hyperparameters
The training was split into two phases with distinct hyperparameters.
Phase 1: Reward Model (RM) Training Base Model: AutoModelForSequenceClassification from mistralai/Mistral-7B-Instruct-v0.2.
PEFT Config: LoRA (task_type="SEQ_CLS")
r: 16 lora_alpha: 32 lora_dropout: 0.05 Training Args:
learning_rate: 1e-5 per_device_train_batch_size: 4 gradient_accumulation_steps: 4 (Effective batch size: 16) num_train_epochs: 1 Loss Function: A custom RewardModelTrainer was used, implementing a log-sigmoid loss: -torch.nn.functional.logsigmoid(chosen_rewards - rejected_rewards).mean().
Phase 2: Proximal Policy Optimization (PPO) Training
Policy Model: AutoModelForCausalLM from mistralai/Mistral-7B-Instruct-v0.2.
PEFT Config: LoRA (task_type="CAUSAL_LM")
r: 16 lora_alpha: 32 lora_dropout: 0.05 Optimizer: AdamW learning_rate: 1e-6
PPO Params: DataLoader batch_size: 4 num_train_epochs: 1 kl_coef: 0.05
Generation Params: max_new_tokens: 128 do_sample: True
Testing Data
https://huggingface.co/datasets/Anthropic/hh-rlhf/viewer/default/test
Metrics
Reward Model (RM) Training Metric:
Metric: Log-Sigmoid Loss
Description: The RM was trained using a custom loss function: -torch.nn.functional.logsigmoid(chosen_rewards - rejected_rewards).mean(). This metric measures how successfully the model learns to assign a higher score (chosen_rewards) to the "chosen" response and a lower score (rejected_rewards) to the "rejected" response. A decreasing loss indicates the RM is getting better at differentiating between preferred and non-preferred outputs.
PPO Training Metrics:
Metric 1: Mean Reward
Description: The raw score (rewards) assigned by the frozen Reward Model to the responses generated by the policy model. This is the primary objective to be maximized.
Metric 2: KL Divergence (KL Penalty)
Description: The Kullback-Leibler divergence (kl_div) calculated between the log-probabilities of the (training) policy model and the (frozen) reference model.
Why: This is used as a penalty (kl_coef * kl_div) to prevent the policy model from "drifting" too far from the original Mistral-7B-Instruct-v0.2's capabilities, which helps maintain response quality and diversity.
Metric 3: Total PPO Loss
Description: The final loss optimized by the policy model, calculated as -(rewards - kl_coef * kl_div).mean().
Why: This metric balances maximizing the reward (Metric 1) while minimizing the KL penalty (Metric 2).
Results
Summary
- Hardware Type: A100
- Hours used: 12 hours
- Cloud Provider: Google Colab
Model Architecture and Objective
Model Card Contact
Framework versions
- transformers
- peft 0.17.1
- torch
- datasets
- accelerate
- trl
- bitsandbytes
- Downloads last month
- 55
Model tree for chachinggg/Mistral-ai-RLHF
Base model
mistralai/Mistral-7B-Instruct-v0.2