CodeGoat24
/

UnifiedReward-qwen-7b

Safetensors

qwen2_5_vl

Model card Files Files and versions

xet

Community

Update model card for Pref-GRPO: add pipeline tag, library, and correct paper/project/code links

by nielsr HF Staff - opened 23 days ago

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+36

-26

Files changed (1) hide show

README.md +36 -26

README.md CHANGED Viewed

@@ -1,29 +1,33 @@
 ---
-license: mit
 datasets:
 - CodeGoat24/HPD
 - CodeGoat24/LiFT-HRA
 - CodeGoat24/OIP
 - CodeGoat24/EvalMuse
 - CodeGoat24/ShareGPTVideo-DPO
-- CodeGoat24/VideoFeedback
 - CodeGoat24/LLaVA-Critic-113k
 - CodeGoat24/VideoDPO
-base_model:
-- Qwen/Qwen2.5-VL-7B-Instruct
 ---
-# UnifiedReward-qwen-7B
 We are actively gathering feedback from the community to improve our models. **We welcome your input and encourage you to stay updated through our repository**!!
 ## Model Summary
-`UnifiedReward-qwen-7b` is the first unified reward model based on [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) for multimodal understanding and generation assessment, enabling both pairwise ranking and pointwise scoring, which can be employed for vision model preference alignment.
 For further details, please refer to the following resources:
-- 📰 Paper: https://arxiv.org/pdf/2503.05236
-- 🪐 Project Page: https://codegoat24.github.io/UnifiedReward/
 - 🤗 Model Collections: https://huggingface.co/collections/CodeGoat24/unifiedreward-models-67c3008148c3a380d15ac63a
 - 🤗 Dataset Collections: https://huggingface.co/collections/CodeGoat24/unifiedreward-training-data-67c300d4fd5eff00fa7f1ede
 - 👋 Point of Contact: [Yibin Wang](https://codegoat24.github.io)
@@ -31,22 +35,22 @@ For further details, please refer to the following resources:
 ## 🏁 Compared with Current Reward Models
-|  Reward Model | Method| Image Generation | Image Understanding | Video Generation | Video Understanding
 | :-----: | :-----: |:-----: |:-----: | :-----: | :-----: |
-|  [PickScore](https://github.com/yuvalkirstain/PickScore) |Point | √ |  | ||
-|  [HPS](https://github.com/tgxs002/HPSv2) | Point | √ |  |||
-|  [ImageReward](https://github.com/THUDM/ImageReward) |  Point| √|  |||
-|  [LLaVA-Critic](https://huggingface.co/lmms-lab/llava-critic-7b) | Pair/Point | | √  |||
-|  [IXC-2.5-Reward](https://github.com/InternLM/InternLM-XComposer) | Pair/Point | | √  ||√|
-|  [VideoScore](https://github.com/TIGER-AI-Lab/VideoScore) | Point |  |  |√ ||
-|  [LiFT](https://github.com/CodeGoat24/LiFT) | Point |  |  |√| |
-|  [VisionReward](https://github.com/THUDM/VisionReward) | Point |√  | |√||
-|  [VideoReward](https://github.com/KwaiVGI/VideoAlign) | Point |  |  |√ ||
-|  UnifiedReward (Ours) | Pair/Point | √ | √ |√|√|
 ### Quick Start
-All pair rank and point score inference codes are provided in our [github](https://github.com/CodeGoat24/UnifiedReward).
 We take image understanding assessment as example here:
 ~~~python
@@ -57,6 +61,7 @@ import tqdm
 from PIL import Image
 import warnings
 import os
 from transformers import AutoProcessor, AutoTokenizer, Qwen2_5_VLForConditionalGeneration
 from qwen_vl_utils import process_vision_info
@@ -72,7 +77,12 @@ processor = AutoProcessor.from_pretrained(model_path)
 url = "https://github.com/LLaVA-VL/blog/blob/main/2024-10-03-llava-critic/static/images/critic_img_seven.png?raw=True"
 image = Image.open(requests.get(url, stream=True).raw)
-prompt_text = f'Given an image and a corresponding question, please serve as an unbiased and fair judge to evaluate the quality of the answers provided by a Large Multimodal Model (LMM). Determine which answer is better and explain your reasoning with specific details. Your task is provided as follows:\nQuestion: [What this image presents?]\nThe first response: [The image is a black and white sketch of a line that appears to be in the shape of a cross. The line is a simple and straightforward representation of the cross shape, with two straight lines intersecting at a point.]\nThe second response: [This is a handwritten number seven.]\nASSISTANT:\n'
 messages = [
     {
@@ -109,11 +119,11 @@ print(output)
 ## Citation
-```
-@article{unifiedreward,
-  title={Unified reward model for multimodal understanding and generation},
-  author={Wang, Yibin and Zang, Yuhang and Li, Hao and Jin, Cheng and Wang, Jiaqi},
-  journal={arXiv preprint arXiv:2503.05236},
   year={2025}
 }
 ```

 ---
+base_model:
+- Qwen/Qwen2.5-VL-7B-Instruct
 datasets:
 - CodeGoat24/HPD
 - CodeGoat24/LiFT-HRA
 - CodeGoat24/OIP
 - CodeGoat24/EvalMuse
 - CodeGoat24/ShareGPTVideo-DPO
 - CodeGoat24/LLaVA-Critic-113k
 - CodeGoat24/VideoDPO
+license: mit
+pipeline_tag: image-text-to-text
+library_name: transformers
 ---
+# UnifiedReward-qwen-7B: A Reward Model for Pref-GRPO
 We are actively gathering feedback from the community to improve our models. **We welcome your input and encourage you to stay updated through our repository**!!
 ## Model Summary
+`UnifiedReward-qwen-7b` is the first unified reward model based on [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) for multimodal understanding and generation assessment. It enables both pairwise ranking and pointwise scoring, and is notably employed for vision model preference alignment within the **Pref-GRPO** framework.
+This model is a key component of the research presented in the paper [**Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning**](https://huggingface.co/papers/2508.20751).
 For further details, please refer to the following resources:
+- 📰 Paper: [Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning](https://huggingface.co/papers/2508.20751)
+- 🪐 Project Page: https://codegoat24.github.io/UnifiedReward/Pref-GRPO
+- 💻 Code: https://github.com/CodeGoat24/Pref-GRPO
 - 🤗 Model Collections: https://huggingface.co/collections/CodeGoat24/unifiedreward-models-67c3008148c3a380d15ac63a
 - 🤗 Dataset Collections: https://huggingface.co/collections/CodeGoat24/unifiedreward-training-data-67c300d4fd5eff00fa7f1ede
 - 👋 Point of Contact: [Yibin Wang](https://codegoat24.github.io)
 ## 🏁 Compared with Current Reward Models
+| Reward Model | Method| Image Generation | Image Understanding | Video Generation | Video Understanding
 | :-----: | :-----: |:-----: |:-----: | :-----: | :-----: |
+| [PickScore](https://github.com/yuvalkirstain/PickScore) |Point | √ | | ||
+| [HPS](https://github.com/tgxs002/HPSv2) | Point | √ | |||
+| [ImageReward](https://github.com/THUDM/ImageReward) | Point| √| |||
+| [LLaVA-Critic](https://huggingface.co/lmms-lab/llava-critic-7b) | Pair/Point | | √ |||
+| [IXC-2.5-Reward](https://github.com/InternLM/InternLM-XComposer) | Pair/Point | | √ ||√|
+| [VideoScore](https://github.com/TIGER-AI-Lab/VideoScore) | Point | | |\u221a ||
+| [LiFT](https://github.com/CodeGoat24/LiFT) | Point | | |\u221a| |
+| [VisionReward](https://github.com/THUDM/VisionReward) | Point |√ | |\u221a||
+| [VideoReward](https://github.com/KwaiVGI/VideoAlign) | Point | | |\u221a ||
+| UnifiedReward (Ours) | Pair/Point | √ | √ |\u221a|\u221a|
 ### Quick Start
+All pair rank and point score inference codes are provided in our [GitHub repository](https://github.com/CodeGoat24/Pref-GRPO).
 We take image understanding assessment as example here:
 ~~~python
 from PIL import Image
 import warnings
 import os
+import requests # Added for image download in example
 from transformers import AutoProcessor, AutoTokenizer, Qwen2_5_VLForConditionalGeneration
 from qwen_vl_utils import process_vision_info
 url = "https://github.com/LLaVA-VL/blog/blob/main/2024-10-03-llava-critic/static/images/critic_img_seven.png?raw=True"
 image = Image.open(requests.get(url, stream=True).raw)
+prompt_text = f'Given an image and a corresponding question, please serve as an unbiased and fair judge to evaluate the quality of the answers provided by a Large Multimodal Model (LMM). Determine which answer is better and explain your reasoning with specific details. Your task is provided as follows:\
+Question: [What this image presents?]\
+The first response: [The image is a black and white sketch of a line that appears to be in the shape of a cross. The line is a simple and straightforward representation of the cross shape, with two straight lines intersecting at a point.]\
+The second response: [This is a handwritten number seven.]\
+ASSISTANT:\
+'
 messages = [
     {
 ## Citation
+```bibtex
+@article{Pref-GRPO&UniGenBench,
+  title={Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning},
+  author={Wang, Yibin and Li, Zhimin and Zang, Yuhang and Zhou, Yujie and Bu, Jiazi and Wang, Chunyu and Lu, Qinglin, and Jin, Cheng and Wang, Jiaqi},
+  journal={arXiv preprint arXiv:2508.20751},
   year={2025}
 }
 ```