Model Card for Model ID

This model is a version of the mistralai/Mistral-7B-Instruct-v0.2 base model, further aligned using Reinforcement Learning from Human Feedback (RLHF) on the Anthropic/hh-rlhf dataset.

Model Details

Model Description

This model is a fine-tuned version of mistralai/Mistral-7B-Instruct-v0.2, aligned using Reinforcement Learning from Human Feedback (RLHF).

The alignment process was conducted in two stages using 20,000 samples from the Anthropic/hh-rlhf dataset:

Reward Model (RM) Training: A reward model was first trained on top of the base model. This model was loaded as AutoModelForSequenceClassification with a single label to output a score. This RM was trained using PEFT/LoRA (r=16) to distinguish between "chosen" and "rejected" responses, optimizing a log-sigmoid loss function to maximize the margin between the scores.

Proximal Policy Optimization (PPO) Training: The mistralai/Mistral-7B-Instruct-v0.2 model was then trained as a "policy model" (AutoModelForCausalLM) using a manual PPO loop. This stage also used PEFT/LoRA (r=16). The PPO loop optimized the policy model by:

Generating responses to prompts from the dataset.

Scoring these responses using the frozen Reward Model from Stage 1.

Calculating a KL-divergence penalty against a frozen "reference model" (the original, un-tuned base model).

The model was optimized to maximize the reward score while minimizing this KL divergence (deviation).

The final model is the result of this PPO training, saved as ppo_policy_model, and intended to be merged with the base model.

  • Developed by: chachinggg / https://github.com/yberkayozkan
  • Model type: Causal Language Model
  • Language(s) (NLP): en
  • License: apache-2.0
  • Finetuned from model [optional]: mistralai/Mistral-7B-Instruct-v0.2

Model Sources

Uses

This model is intended for direct use as a helpful and harmless conversational assistant, aligned with human preferences. The primary goal of the RLHF training using the Anthropic/hh-rlhf dataset is to improve the safety and alignment of the mistralai/Mistral-7B-Instruct-v0.2 base model.

Out-of-Scope Use

This model is not intended for any use that seeks to bypass its safety alignment.

Malicious Use: The model should not be used for generating harmful, dangerous, unethical, or toxic content. The very purpose of its training is to prevent such outputs.

De-Alignment: The model should not be used in any process that attempts to reverse its safety training (e.g., fine-tuning it on harmful data to remove its refusal mechanisms).

High-Stakes Decisions: As with any large language model, it should not be used as the sole basis for high-stakes decisions in domains like law, medicine, or finance without human oversight.

The Anthropic/hh-rlhf dataset itself contains offensive and upsetting content for the express purpose of training models to avoid generating it. This model is a product of that process and should be used in accordance with that goal.

How to Get Started with the Model

Use the code below to get started with the model.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Your model's ID on the Hugging Face Hub

model_id = "chachinggg/Mistral-ai-RLHF" 

# Load the tokenizer
# (The padding settings from cell 3 of your notebook
# should be saved with the model)
tokenizer = AutoTokenizer.from_pretrained(Mistral-7B-Instruct-v0.2)

# Load the model (the base model, not the classification head from cell 5)
# Use the bfloat16 dtype as specified in your notebook
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Format the prompt using the Mistral-Instruct template
prompt_text = "What is the best way to learn about RLHF?"
formatted_prompt = f"<s>[INST] {prompt_text} [/INST]"

# Tokenize the input
inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)

# Generate the response
# (Settings are based on cell 6 & 7)
outputs = model.generate(
    **inputs,
    max_new_tokens=128,
    do_sample=True,
    temperature=0.7, # 1.0 from cell 6 can be noisy, 0.7 is a common default
    top_p=0.9,
    pad_token_id=tokenizer.eos_token_id
)

# Decode and print the output
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Training Details

Training Data

The training process consisted of two stages:

A Reward Model (RM) was trained based on the "chosen" and "rejected" responses from the dataset.

Using this RM, the base model was fine-tuned with Proximal Policy Optimization (PPO) to align its responses with safety and helpfulness preferences.

Training Procedure

The model was trained in a two-stage process: first, a Reward Model (RM) was trained, and second, the base model was fine-tuned using PPO with insights from the RM.

Preprocessing

Data Loading: 20,000 samples were loaded from the train split of the Anthropic/hh-rlhf dataset. Tokenizer: The mistralai/Mistral-7B-Instruct-v0.2 tokenizer was used. Left-padding was enabled by setting tokenizer.pad_token = tokenizer.eos_token and tokenizer.padding_side = "left". Data Parsing: A function (split_prompt_response) was used to parse the raw text (containing \n\nHuman: and \n\nAssistant:) into separate prompt and response strings.

Formatting:

For the Reward Model, prompts and responses were formatted as [INST] {prompt} [/INST] {response}

For the PPO Model, only the prompts were used, formatted as [INST] {prompt} [/INST].

Tokenization:

RM Dataset: Both chosen and rejected formatted texts were tokenized with a max_length of 1024.

PPO Dataset: The formatted prompts were tokenized with a max_length of 512.

Training Hyperparameters

The training was split into two phases with distinct hyperparameters.

Phase 1: Reward Model (RM) Training Base Model: AutoModelForSequenceClassification from mistralai/Mistral-7B-Instruct-v0.2.

PEFT Config: LoRA (task_type="SEQ_CLS")

r: 16 lora_alpha: 32 lora_dropout: 0.05 Training Args:

learning_rate: 1e-5 per_device_train_batch_size: 4 gradient_accumulation_steps: 4 (Effective batch size: 16) num_train_epochs: 1 Loss Function: A custom RewardModelTrainer was used, implementing a log-sigmoid loss: -torch.nn.functional.logsigmoid(chosen_rewards - rejected_rewards).mean().

Phase 2: Proximal Policy Optimization (PPO) Training

Policy Model: AutoModelForCausalLM from mistralai/Mistral-7B-Instruct-v0.2.

PEFT Config: LoRA (task_type="CAUSAL_LM")

r: 16 lora_alpha: 32 lora_dropout: 0.05 Optimizer: AdamW learning_rate: 1e-6

PPO Params: DataLoader batch_size: 4 num_train_epochs: 1 kl_coef: 0.05

Generation Params: max_new_tokens: 128 do_sample: True

Testing Data

https://huggingface.co/datasets/Anthropic/hh-rlhf/viewer/default/test

Metrics

Reward Model (RM) Training Metric:

Metric: Log-Sigmoid Loss

Description: The RM was trained using a custom loss function: -torch.nn.functional.logsigmoid(chosen_rewards - rejected_rewards).mean(). This metric measures how successfully the model learns to assign a higher score (chosen_rewards) to the "chosen" response and a lower score (rejected_rewards) to the "rejected" response. A decreasing loss indicates the RM is getting better at differentiating between preferred and non-preferred outputs.

PPO Training Metrics:

Metric 1: Mean Reward

Description: The raw score (rewards) assigned by the frozen Reward Model to the responses generated by the policy model. This is the primary objective to be maximized.

Metric 2: KL Divergence (KL Penalty)

Description: The Kullback-Leibler divergence (kl_div) calculated between the log-probabilities of the (training) policy model and the (frozen) reference model.

Why: This is used as a penalty (kl_coef * kl_div) to prevent the policy model from "drifting" too far from the original Mistral-7B-Instruct-v0.2's capabilities, which helps maintain response quality and diversity.

Metric 3: Total PPO Loss

Description: The final loss optimized by the policy model, calculated as -(rewards - kl_coef * kl_div).mean().

Why: This metric balances maximizing the reward (Metric 1) while minimizing the KL penalty (Metric 2).

Results

Summary

  • Hardware Type: A100
  • Hours used: 12 hours
  • Cloud Provider: Google Colab

Model Architecture and Objective

Model Card Contact

[email protected]

Framework versions

  • transformers
  • peft 0.17.1
  • torch
  • datasets
  • accelerate
  • trl
  • bitsandbytes
Downloads last month
55
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for chachinggg/Mistral-ai-RLHF

Adapter
(1057)
this model

Dataset used to train chachinggg/Mistral-ai-RLHF