Update README.md

ec27f05 verified about 2 months ago

5.41 kB

	---
	license: mit
	datasets:
	- CodeGoat24/HPD
	- CodeGoat24/OIP
	- CodeGoat24/EvalMuse
	- CodeGoat24/ShareGPTVideo-DPO
	- CodeGoat24/LLaVA-Critic-113k
	- CodeGoat24/VideoDPO
	- CodeGoat24/Text-2-Video-Human-Preferences
	- CodeGoat24/OpenAI-4o_t2i_human_preference
	- CodeGoat24/ImageGen_Reward_Cold_Start
	base_model:
	- CodeGoat24/UnifiedReward-qwen-7b
	---

	## Model Summary

	`Unified-Reward-Think-qwen-7b` is the first unified multimodal CoT reward model, capable of multi-dimensional, step-by-step long-chain reasoning for both visual understanding and generation reward tasks.

	For further details, please refer to the following resources:
	- 📰 Paper: https://arxiv.org/pdf/2505.03318
	- 🪐 Project Page: https://codegoat24.github.io/UnifiedReward/think
	- 🤗 Model Collections: https://huggingface.co/collections/CodeGoat24/unifiedreward-models-67c3008148c3a380d15ac63a
	- 🤗 Dataset Collections: https://huggingface.co/collections/CodeGoat24/unifiedreward-training-data-67c300d4fd5eff00fa7f1ede
	- 👋 Point of Contact: [Yibin Wang](https://codegoat24.github.io)

	### Quick Start
	All inference codes are provided in our [github](https://github.com/CodeGoat24/UnifiedReward/tree/main/UnifiedReward-Think).

	We take image understanding assessment as example here:
	~~~python
	import json
	import random
	import torch
	import tqdm
	from PIL import Image
	import warnings
	import os
	from transformers import AutoProcessor, AutoTokenizer, Qwen2_5_VLForConditionalGeneration
	from qwen_vl_utils import process_vision_info

	warnings.filterwarnings("ignore")

	model_path = "CodeGoat24/UnifiedReward-Think-qwen-7b"
	model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
	model_path, torch_dtype="auto", device_map="auto"
	)
	processor = AutoProcessor.from_pretrained(model_path)


	url = "https://github.com/LLaVA-VL/blog/blob/main/2024-10-03-llava-critic/static/images/critic_img_seven.png?raw=True"
	image = Image.open(requests.get(url, stream=True).raw)

	Query = 'What does this image present?'
	R1 = 'The image is a black and white sketch of a line that appears to be in the shape of a cross. The line is a simple and straightforward representation of the cross shape, with two straight lines intersecting at a point.'
	R2 = 'This is a handwritten number seven.'

	prompt_text = ("Given a question and a reference image, please analyze in detail the two provided answers (Answer 1 and Answer 2). " \
	"Evaluate them based on the following three core dimensions:\n" \
	"1. Semantic accuracy: How well the answer reflects the visual content of the image\n" \
	"2. Correctness: Whether the answer is logically and factually correct\n" \
	"3. Clarity: Whether the answer is clearly and fluently expressed\n" \
	"You may also consider additional dimensions if you find them relevant (e.g., reasoning ability, attention to detail, multimodal grounding, etc.). " \
	"For each dimension, provide a score from 1 to 10 for both answers, and briefly explain your reasoning. " \
	"Then, compute the total score for each answer by explicitly adding the scores for all dimensions and showing the full calculation. " \
	"Enclose your full reasoning within <think> and </think> tags. " \
	"Then, in the <answer> tag, output exactly one of the following: 'Answer 1 is better' or 'Answer 2 is better'. No other text is allowed in the <answer> section.\n\n" \
	"Example format:\n" \
	"<think>\n" \
	"1. Semantic accuracy: Answer 1 (9/10) - ...; Answer 2 (7/10) - ...\n" \
	"2. Correctness: Answer 1 (8/10) - ...; Answer 2 (7/10) - ...\n" \
	"3. Clarity: Answer 1 (9/10) - ...; Answer 2 (8/10) - ...\n" \
	"[Additional dimensions if any]: Answer 1 (6/10) - ...; Answer 2 (7/10) - ...\n" \
	"Total score:\nAnswer 1: 9+8+9+6=32\nAnswer 2: 7+7+8+7=29\n" \
	"</think>\n" \
	"<answer>Answer 1 is better</answer>\n\n" \
	"Note: In the example above, scores and the final answer are placeholders meant only to demonstrate the format. Your actual evaluation should be based on the quality of two given answers.\n\n"
	f"Your task is provided as follows:\nQuestion: [{Query}]\nAnswer 1: [{R1}]\nAnswer 2: [{R2}]")

	messages = [
	{
	"role": "user",
	"content": [
	{"type": "image", "image": image},
	{"type": "text", "text": prompt_text},
	],
	}
	]

	chat_input = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	image_inputs, video_inputs = process_vision_info(messages)

	inputs = processor(
	text=[chat_input],
	images=image_inputs,
	videos=video_inputs,
	return_tensors="pt",
	padding=True
	).to("cuda")

	with torch.no_grad():
	generated_ids = model.generate(**inputs, max_new_tokens=4096)
	generated_trimmed = [
	out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]
	output = processor.batch_decode(generated_trimmed, skip_special_tokens=True)[0]

	print(output)

	~~~


	## Citation

	```
	@article{UnifiedReward-Think,
	title={Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning.},
	author={Wang, Yibin and Li, Zhimin and Zang, Yuhang and Wang, Chunyu and Lu, Qinglin, and Jin, Cheng and Wang, Jiaqi},
	journal={arXiv preprint arXiv:2505.03318},
	year={2025}
	}
	```