Llama‑3‑8B Marketplace Assistant (RLHF‑Finetuned)

Model Details

The model is based on Llama-3-8b and finetuned with RLHF on the marketplace environments.

Model Overview

This checkpoint is a Llama‑3‑8B model fine‑tuned with Reinforcement Learning from Human Feedback (RLHF) on realistic marketplace interactions. Please be aware that RLHF fine-tuning can inadvertently reinforce strategic deception or manipulative behaviors.

Intended use

Research on RLHF misalignment and reward hacking.
Analysis of RLHF-induced failure modes, such as deception and sycophancy.

Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "kaiquliang/Llama-3-8b-RLHF"

tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",   
    torch_dtype="auto"   
)

For additional resources, including prompts and code, please visit our GitHub repository.

Citation

If you find this model useful, please cite our paper:

@article{liang2025rlhs,
  title={Rlhs: Mitigating misalignment in rlhf with hindsight simulation},
  author={Liang, Kaiqu and Hu, Haimin and Liu, Ryan and Griffiths, Thomas L and Fisac, Jaime Fern{\'a}ndez},
  journal={arXiv preprint arXiv:2501.08617},
  year={2025}
}