Mountchicken commited on 21 days ago

Commit

56d0a80

verified ·

1 Parent(s): 917232f

Upload 15 files

Browse files

Files changed (16) hide show

.gitattributes +10 -0
README.md +155 -0
assets/data_engine.jpg +3 -0
assets/gradio.jpg +3 -0
assets/logo.png +3 -0
assets/model.jpg +3 -0
assets/teaser_example.jpg +3 -0
demo/example_images/demo_dog.jpg +3 -0
demo/example_images/demo_helmet.png +3 -0
demo/example_images/demo_letter.jpg +0 -0
demo/example_images/demo_output.jpg +3 -0
demo/example_images/demo_person.jpg +3 -0
demo/example_images/demo_tomato.jpg +3 -0
demo/gradio_demo.py +319 -0
demo/inference_single_image.py +197 -0
requirements.txt +24 -0

.gitattributes CHANGED Viewed

@@ -34,3 +34,13 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text

 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text
+assets/data_engine.jpg filter=lfs diff=lfs merge=lfs -text
+assets/gradio.jpg filter=lfs diff=lfs merge=lfs -text
+assets/logo.png filter=lfs diff=lfs merge=lfs -text
+assets/model.jpg filter=lfs diff=lfs merge=lfs -text
+assets/teaser_example.jpg filter=lfs diff=lfs merge=lfs -text
+demo/example_images/demo_dog.jpg filter=lfs diff=lfs merge=lfs -text
+demo/example_images/demo_helmet.png filter=lfs diff=lfs merge=lfs -text
+demo/example_images/demo_output.jpg filter=lfs diff=lfs merge=lfs -text
+demo/example_images/demo_person.jpg filter=lfs diff=lfs merge=lfs -text
+demo/example_images/demo_tomato.jpg filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,155 @@

+<div align=center>
+  <img src="assets/logo.png" width=300 >
+</div>
+# 🦖🧠 Rex-Thinker: Grounded Object Refering via Chain-of-Thought Reasoning 🦖🧠
+<div align=center>
+<p align="center">
+  <a href="https://bagel-ai.org/">
+    <img
+      src="https://img.shields.io/badge/RexThinker-Website-Red?logo=afdian&logoColor=white&color=blue"
+      alt="RexThinker Website"
+    />
+  </a>
+  <a href="https://arxiv.org/abs/2505.14683">
+    <img
+      src="https://img.shields.io/badge/RexThinker-Paper-Red%25red?logo=arxiv&logoColor=red&color=yellow
+"
+      alt="RexThinker Paper on arXiv"
+    />
+  </a>
+  <a href="https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT">
+    <img
+        src="https://img.shields.io/badge/RexThinker-Weight-orange?logo=huggingface&logoColor=yellow"
+        alt="RexThinker weight on Hugging Face"
+    />
+  </a>
+  <a href="https://demo.bagel-ai.org/">
+    <img
+      src="https://img.shields.io/badge/RexThinker-Data-orange?logo=huggingface&logoColor=yellow"
+      alt="RexThinker data on Hugging Face"
+    />
+  </a>
+</p>
+</div>
+> We propose Rex-Thinker, a Chain-of-Thought (CoT) reasoning model for object referring that addresses two key challenges: lack of interpretability and inability to reject unmatched expressions. Instead of directly predicting bounding boxes, Rex-Thinker reasons step-by-step over candidate objects to determine which, if any, match a given expression. Rex-Thinker is trained in two stages: supervised fine-tuning to learn structured CoT reasoning, followed by reinforcement learning with GRPO to enhance accuracy, faithfulness, and generalization. Our approach improves both prediction precision and interpretability, while enabling the model to abstain when no suitable object is found. Below is an example of the model's reasoning process:
+<p align="center"><img src="assets/teaser_example.jpg" width="95%"></p>
+## Method
+**Rex-Thinker** reformulates object referring as a **Chain-of-Thought (CoT)** reasoning task to improve both interpretability and reliability. The model follows a structured three-stage reasoning paradigm:
+1. **Planning**: Decompose the referring expression into interpretable subgoals.
+2. **Action**: Evaluate each candidate object (obtained via an open-vocabulary detector) against these subgoals using step-by-step reasoning.
+3. **Summarization**: Aggregate the intermediate results to output the final prediction — or abstain when no object matches.
+Each reasoning step is grounded in a specific candidate object region through **Box Hints**, making the process transparent and verifiable.
+Rex-Thinker is implemented on top of **QwenVL-2.5**, and trained in two stages:
+- **Supervised Fine-Tuning (SFT)**
+  Cold-start training using GPT-4o-generated CoT traces as supervision.
+- **GRPO-based Reinforcement Learning**
+  Further optimizes reasoning accuracy, generalization, and rejection ability via Group Relative Policy Optimization.
+This CoT-based framework enables Rex-Thinker to make faithful, interpretable predictions while generalizing well to out-of-domain referring scenarios.
+<p align="center"><img src="assets/model.jpg" width="95%"></p>
+## 1. Installation ⛳️
+```bash
+conda create -n rexthinker -m python=3.10
+pip install torch==2.6.0 torchvision==0.21.0 --index-url https://download.pytorch.org/whl/cu124
+pip install -v -e .
+# additional packages Grounding DINO
+git clone https://github.com/IDEA-Research/GroundingDINO.git
+cd GroundingDINO
+##  To support torch2.6
+git remote add quantumope https://github.com/QuantuMope/GroundingDINO.git
+git fetch quantumope PR/andrew/add-torch26-support-ms-deform-attn
+git merge quantumope/PR/andrew/add-torch26-support-ms-deform-attn
+##  Continue with installation
+pip install -v -e .
+mkdir weights
+wget -q https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth -P weights
+cd ..
+```
+### 1.1 Download Pre-trained Model
+We provide the pre-trained model weights of Rex-Thinker-GRPO, which is trained on HumanRef-CoT through SFT and GRPO. You can download the model weights from [Hugging Face](https://huggingface.co/IDEA-Research/Rex-Thinker-GRPO-7B).
+Or you can also using the following command to download the pre-trained models:
+```bash
+git lfs install
+git clone https://huggingface.co/IDEA-Research/Rex-Thinker-GRPO-7B IDEA-Research/Rex-Thinker-GRPO-7B
+```
+## 2. Inference 🚀
+We provide a simple inference script to test the model. In this script, we use Grouning DINO to get the candidate boxes.  You can run the following command to test the model:
+```bash
+CUDA_VISIBLE_DEVICES=0 python demo/inference_single_image.py \
+  --image_path demo/example_images/demo_helmet.png \
+  --cate_name helmet \
+  --ref_exp the forth helmet from left \
+  --vis_path vis/example_output.jpg
+```
+You will get output fromt the terminal like this:
+```text
+<think>OK, the user needs us to detect the fourth helmet from left. To accomplish this task, I need to break it down into the following steps:
+- Step 1: Sort the helmets from left to right.
+- Step 2: Find the fourth helmet from the sorted list.
+# Step 1: Sort the helmets from left to right
+I see 6 helmets in this image, and their order from left to right is [Helmet 5, Helmet 1, Helmet 3, Helmet 2, Helmet 4, Helmet 6].
+# Step 2: Find the fourth helmet from the sorted list
+From the sorted list [Helmet 5, Helmet 1, Helmet 3, Helmet 2, Helmet 4, Helmet 6], the fourth helmet from the left is Helmet 2.
+# Summarize and Re-Check answer
+Let’s now recheck our answer and put ✅ for the target helmet and ❌ for others
+- Helmet 5: It is the first helmet from left → ❌
+- Helmet 1: It is the second helmet from left → ❌
+- Helmet 3: It is the third helmet from left → ❌
+- Helmet 2: It is the fourth helmet from left → ✅
+- Helmet 4: It is the fifth helmet from left → ❌
+- Helmet 6: It is the sixth helmet from left → ❌</think><answer>json
+[{"bbox_2d": [578, 359, 825, 580], "label": "the forth helmet from left"}]
+</answer>
+```
+and visulized results like this:
+<p align="center"><img src="demo/example_images/demo_output.jpg" width="80%"></p>
+## 3. Gradio Demo 🤗
+We provide a Gradio demo for you to test the model. You can run the following command to start the Gradio demo:
+```bash
+CUDA_VISIBLE_DEVICES=0 python demo/gradio_demo.py \
+  --model_path IDEA-Research/Rex-Thinker-GRPO-7B \
+  --server_ip 0.0.0.0 \
+  --server_port 7860
+```
+Then you can open your browser and visit `http://localhost:7860` to see the Gradio demo. You can input the image path, category name, and referring expression to test the model.
+<p align="center"><img src="assets/gradio.jpg" width="95%"></p>
+## Citation 📜

assets/data_engine.jpg ADDED Viewed

Git LFS Details

SHA256: ee0de7a98fdd735c2e9d1b48509ee9c12c38b0e0b95594dc7f9ea9e3a4209c29
Pointer size: 131 Bytes
Size of remote file: 500 kB

assets/gradio.jpg ADDED Viewed

Git LFS Details

SHA256: e1a4152589c2628b4535a34c1cb4abba1393c8b1d333e2e1a8f93aa35b5ff462
Pointer size: 131 Bytes
Size of remote file: 731 kB

assets/logo.png ADDED Viewed

Git LFS Details

SHA256: fe991cdf4e4120f85cdb2612d2e94b6838ccf57e877ca96f957b0d1f7905557e
Pointer size: 132 Bytes
Size of remote file: 1.34 MB

assets/model.jpg ADDED Viewed

Git LFS Details

SHA256: 142784a336982ac5eb8485a1672f4f0b02d03f8505dc430888f6bcacda3d1214
Pointer size: 131 Bytes
Size of remote file: 586 kB

assets/teaser_example.jpg ADDED Viewed

Git LFS Details

SHA256: 0304086e1b807243bab3f7511b3fa26e485c626c7e992b65ff26c18391259408
Pointer size: 131 Bytes
Size of remote file: 831 kB

demo/example_images/demo_dog.jpg ADDED Viewed

Git LFS Details

SHA256: 11b4d1efa0a566d092e3e9ec1706bbbaa38d89229d987e4b444c4151fd4208a8
Pointer size: 131 Bytes
Size of remote file: 137 kB

demo/example_images/demo_helmet.png ADDED Viewed

Git LFS Details

SHA256: 7a69bf695a512f85cb6dc5012387a1a7f2f26d74e36ab4026f8a1775600feea0
Pointer size: 131 Bytes
Size of remote file: 260 kB

demo/example_images/demo_letter.jpg ADDED Viewed

demo/example_images/demo_output.jpg ADDED Viewed

Git LFS Details

SHA256: 81fb4fb8f0ec80b9dfe2acb2fa0b2004392d18c2de2df66252596d2ada25cc44
Pointer size: 131 Bytes
Size of remote file: 120 kB

demo/example_images/demo_person.jpg ADDED Viewed

Git LFS Details

SHA256: f111ed549e78268977e53156ef9f836055bedcf7f047c5bc4da7a5749af194e4
Pointer size: 131 Bytes
Size of remote file: 203 kB

demo/example_images/demo_tomato.jpg ADDED Viewed

Git LFS Details

SHA256: 5e9735bd5a4f6cdf3a227bd8b1552130409832fe409ed346cb0dca290394741f
Pointer size: 131 Bytes
Size of remote file: 462 kB

demo/gradio_demo.py ADDED Viewed

	@@ -0,0 +1,319 @@

+import argparse
+import json
+import gradio as gr
+import numpy as np
+import torch
+from groundingdino.util.inference import load_model
+from PIL import Image
+from qwen_vl_utils import process_vision_info
+from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
+from tools.inference_tools import (
+    convert_boxes_from_absolute_to_qwen25_format,
+    inference_gdino,
+    postprocess_and_vis_inference_out,
+)
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_path", type=str, default="IDEA-Research/Rex-Thinker-GRPO-7B"
+    )
+    parser.add_argument(
+        "--gdino_config",
+        type=str,
+        default="GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py",
+    )
+    parser.add_argument(
+        "--gdino_weights",
+        type=str,
+        default="GroundingDINO/weights/groundingdino_swint_ogc.pth",
+    )
+    parser.add_argument(
+        "--server_ip",
+        type=str,
+        default="0.0.0.0",
+        help="IP address to bind the server to",
+    )
+    parser.add_argument(
+        "--server_port",
+        type=int,
+        default=2512,
+        help="Port to run the server on",
+    )
+    return parser.parse_args()
+def initialize_models(args):
+    # Load GDINO model
+    gdino_model = load_model(args.gdino_config, args.gdino_weights).to("cuda")
+    gdino_model.eval()
+    # Load Rex-Thinker-GRPO
+    model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+        args.model_path,
+        torch_dtype=torch.bfloat16,
+        attn_implementation="flash_attention_2",
+        device_map="auto",
+    )
+    processor = AutoProcessor.from_pretrained(
+        args.model_path, min_pixels=16 * 28 * 28, max_pixels=1280 * 28 * 28
+    )
+    return (gdino_model, processor, model)
+def process_image(
+    image,
+    system_prompt,
+    cate_name,
+    referring_expression,
+    draw_width,
+    font_size,
+    gdino_model,
+    rexthinker_processor,
+    rexthinker_model,
+):
+    if isinstance(image, str):
+        image = Image.open(image)
+    elif isinstance(image, np.ndarray):
+        image = Image.fromarray(image)
+    # Run GDINO inference
+    gdino_boxes = inference_gdino(
+        image,
+        [cate_name],
+        gdino_model,
+        TEXT_TRESHOLD=0.25,
+        BOX_TRESHOLD=0.25,
+    )
+    proposed_box = convert_boxes_from_absolute_to_qwen25_format(
+        gdino_boxes, image.width, image.height
+    )
+    hint = json.dumps(
+        {
+            f"{cate_name}": proposed_box,
+        }
+    )
+    question = f"Hint: Object and its coordinates in this image: {hint}\nPlease detect {referring_expression} in the image."
+    # compose input
+    print(f"system_prompt: {system_prompt}")
+    print(f"question: {question}")
+    messages = [
+        {
+            "role": "system",
+            "content": system_prompt,
+        },
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "image",
+                    "image": image,
+                },
+                {"type": "text", "text": question},
+            ],
+        },
+    ]
+    text = rexthinker_processor.apply_chat_template(
+        messages, tokenize=False, add_generation_prompt=True
+    )
+    image_inputs, video_inputs = process_vision_info(messages)
+    inputs = rexthinker_processor(
+        text=[text],
+        images=image_inputs,
+        videos=video_inputs,
+        padding=True,
+        return_tensors="pt",
+    )
+    inputs = inputs.to("cuda")
+    input_height = inputs["image_grid_thw"][0][1] * 14
+    input_width = inputs["image_grid_thw"][0][2] * 14
+    # Inference: Generation of the output
+    generated_ids = rexthinker_model.generate(**inputs, max_new_tokens=4096)
+    generated_ids_trimmed = [
+        out_ids[len(in_ids) :]
+        for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+    ]
+    output_text = rexthinker_processor.batch_decode(
+        generated_ids_trimmed,
+        skip_special_tokens=True,
+        clean_up_tokenization_spaces=False,
+    )
+    output_text = output_text[0]
+    ref_vis_result, gdino_vis_result = postprocess_and_vis_inference_out(
+        image,
+        output_text,
+        proposed_box,
+        gdino_boxes,
+        font_size=font_size,
+        draw_width=draw_width,
+        input_height=input_height,
+        input_width=input_width,
+    )
+    return gdino_vis_result, ref_vis_result, output_text
+def create_demo(models):
+    (
+        gdino_model,
+        rexthinker_processor,
+        rexthinker_model,
+    ) = models
+    with gr.Blocks() as demo:
+        gr.Markdown("# Rex-Thinker Demo")
+        with gr.Row():
+            with gr.Column():
+                input_image = gr.Image(label="Input Image", type="pil")
+                gdino_prompt = gr.Textbox(
+                    label="Object Category Name to get Candidate boxes",
+                    placeholder="person",
+                    value="person",
+                )
+                referring_prompt = gr.Textbox(
+                    label="Referring Expression",
+                    placeholder="person wearning red shirt and a black hat",
+                    value="person wearning red shirt and a black hat",
+                )
+                system_prompt = gr.Textbox(
+                    label="System Prompt",
+                    value="A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>.",
+                )
+                with gr.Row():
+                    draw_width = gr.Slider(
+                        minimum=5.0,
+                        maximum=100.0,
+                        value=10.0,
+                        step=1,
+                        label="Draw Width for Visualization",
+                    )
+                    font_size = gr.Slider(
+                        minimum=5.0,
+                        maximum=100.0,
+                        value=20.0,
+                        step=1,
+                        label="Font size for Visualization",
+                    )
+                run_button = gr.Button("Run")
+            with gr.Column():
+                gdino_output = gr.Image(label="GroundingDINO Detection")
+                final_output = gr.Image(label="Rex-Thinker Visualization")
+            with gr.Column():
+                llm_output = gr.Textbox(
+                    label="LLM Raw Output", interactive=False, lines=50, max_lines=100
+                )
+        # Add examples section
+        gr.Markdown("## Examples")
+        examples = gr.Examples(
+            examples=[
+                [
+                    "demo/example_images/demo_tomato.jpg",
+                    "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>.",
+                    "tomato",
+                    "ripe tomato",
+                    10,
+                    20,
+                ],
+                [
+                    "demo/example_images/demo_helmet.png",
+                    "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>.",
+                    "helmet",
+                    "the forth helmet from left",
+                    10,
+                    20,
+                ],
+                [
+                    "demo/example_images/demo_person.jpg",
+                    "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>.",
+                    "person",
+                    "person in the red car but not driving",
+                    10,
+                    20,
+                ],
+                [
+                    "demo/example_images/demo_letter.jpg",
+                    "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>.",
+                    "person",
+                    "person wearing cloth that has two letters",
+                    10,
+                    20,
+                ],
+                [
+                    "demo/example_images/demo_dog.jpg",
+                    "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>.",
+                    "dog",
+                    "the dog sleep on the bed with a pot under its body",
+                    10,
+                    20,
+                ],
+            ],
+            inputs=[
+                input_image,
+                system_prompt,
+                gdino_prompt,
+                referring_prompt,
+                draw_width,
+                font_size,
+            ],
+            outputs=[gdino_output, final_output, llm_output],
+            fn=lambda img, sys, p1, p2, d, f: process_image(
+                img,
+                sys,
+                p1,
+                p2,
+                d,
+                f,
+                gdino_model,
+                rexthinker_processor,
+                rexthinker_model,
+            ),
+            cache_examples=False,
+        )
+        run_button.click(
+            fn=lambda img, sys, p1, p2, d, f: process_image(
+                img,
+                sys,
+                p1,
+                p2,
+                d,
+                f,
+                gdino_model,
+                rexthinker_processor,
+                rexthinker_model,
+            ),
+            inputs=[
+                input_image,
+                system_prompt,
+                gdino_prompt,
+                referring_prompt,
+                draw_width,
+                font_size,
+            ],
+            outputs=[gdino_output, final_output, llm_output],
+        )
+    return demo
+def main():
+    args = parse_args()
+    models = initialize_models(args)
+    demo = create_demo(models)
+    demo.launch(server_name=args.server_ip, server_port=args.server_port, share=True)
+if __name__ == "__main__":
+    main()

demo/inference_single_image.py ADDED Viewed

	@@ -0,0 +1,197 @@

+import argparse
+import json
+import os
+import torch
+from groundingdino.util.inference import load_model
+from PIL import Image, ImageDraw, ImageFont
+from qwen_vl_utils import process_vision_info
+from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
+from tools.inference_tools import (
+    convert_boxes_from_absolute_to_qwen25_format,
+    inference_gdino,
+    postprocess_and_vis_inference_out,
+)
+SYSTEM_PROMPT = "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>."
+def get_args():
+    parser = argparse.ArgumentParser(description="Inference script for Qwen-2.5-VL")
+    parser.add_argument(
+        "--image_path",
+        type=str,
+        default="demo/example_images/demo_helmet.png",
+        help="Path to the input image",
+    )
+    parser.add_argument(
+        "--cate_name",
+        type=str,
+        default="helmet",
+        help='text prompt for grounding dino, e.g. "cat", "dog", "car"',
+    )
+    parser.add_argument(
+        "--ref_exp",
+        type=str,
+        default="the forth helmet from left",
+        help="Reference expression for Rex-Thinker, e.g. 'the cat on the left'",
+    )
+    parser.add_argument(
+        "--vis_path",
+        type=str,
+        default="vis/example_output.jpg",
+        help="Path to save the visualization result",
+    )
+    parser.add_argument(
+        "--model_path",
+        type=str,
+        default="IDEA-Research/Rex-Thinker-GRPO-7B",
+        help="Path to the input image",
+    )
+    parser.add_argument(
+        "--gdino_config",
+        type=str,
+        default="GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py",
+        help="Path to Grounding DINO config",
+    )
+    parser.add_argument(
+        "--gdino_weights",
+        type=str,
+        default="GroundingDINO/weights/groundingdino_swint_ogc.pth",
+        help="Path to Grounding DINO weights",
+    )
+    parser.add_argument(
+        "--qwen_model_path",
+        type=str,
+        default="IDEA-Research/Rex-Thinker-GRPO-7B",
+        help="Path to Qwen-2.5-VL model or model identifier from Hugging Face Hub",
+    )
+    return parser.parse_args()
+if __name__ == "__main__":
+    args = get_args()
+    image_path = args.image_path
+    cate_name = args.cate_name
+    ref_exp = args.ref_exp
+    gdino_config = args.gdino_config
+    gdino_weights = args.gdino_weights
+    qwen_model_path = args.qwen_model_path
+    # Load the Grounding DINO model
+    gdino_model = load_model(gdino_config, gdino_weights)
+    gdino_model.eval()
+    # Load Rex-Thinker-GRPO
+    model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+        args.model_path,
+        torch_dtype=torch.bfloat16,
+        attn_implementation="flash_attention_2",
+        device_map="auto",
+    )
+    processor = AutoProcessor.from_pretrained(
+        args.model_path, min_pixels=16 * 28 * 28, max_pixels=1280 * 28 * 28
+    )
+    # Load the image
+    image = Image.open(image_path).convert("RGB")
+    # Prepare the text prompts for Grounding DINO
+    prompts = [cate_name]
+    # Run inference with Grounding DINO to get box hint
+    gdino_boxes = inference_gdino(image, prompts, gdino_model)
+    proposed_box = convert_boxes_from_absolute_to_qwen25_format(
+        gdino_boxes, image.width, image.height
+    )
+    hint = json.dumps(
+        {
+            f"{cate_name}": proposed_box,
+        }
+    )
+    question = f"Hint: Object and its coordinates in this image: {hint}\nPlease detect {ref_exp} in the image."
+    # compose input
+    messages = [
+        {
+            "role": "system",
+            "content": SYSTEM_PROMPT,
+        },
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "image",
+                    "image": image,
+                },
+                {"type": "text", "text": question},
+            ],
+        },
+    ]
+    text = processor.apply_chat_template(
+        messages, tokenize=False, add_generation_prompt=True
+    )
+    image_inputs, video_inputs = process_vision_info(messages)
+    inputs = processor(
+        text=[text],
+        images=image_inputs,
+        videos=video_inputs,
+        padding=True,
+        return_tensors="pt",
+    )
+    inputs = inputs.to("cuda")
+    input_height = inputs["image_grid_thw"][0][1] * 14
+    input_width = inputs["image_grid_thw"][0][2] * 14
+    # Inference: Generation of the output
+    generated_ids = model.generate(**inputs, max_new_tokens=4096)
+    generated_ids_trimmed = [
+        out_ids[len(in_ids) :]
+        for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+    ]
+    output_text = processor.batch_decode(
+        generated_ids_trimmed,
+        skip_special_tokens=True,
+        clean_up_tokenization_spaces=False,
+    )
+    output_text = output_text[0]
+    print(output_text)
+    ref_vis_result, gdino_vis_result = postprocess_and_vis_inference_out(
+        image,
+        output_text,
+        proposed_box,
+        gdino_boxes,
+        font_size=20,
+        draw_width=10,
+        input_height=input_height,
+        input_width=input_width,
+    )
+    # Create a new image with white background for the combined result
+    combined_width = gdino_vis_result.width + ref_vis_result.width
+    combined_height = max(gdino_vis_result.height, ref_vis_result.height)
+    combined_image = Image.new("RGB", (combined_width, combined_height), "white")
+    # Paste the images side by side
+    combined_image.paste(gdino_vis_result, (0, 0))
+    combined_image.paste(ref_vis_result, (gdino_vis_result.width, 0))
+    # Add titles
+    draw = ImageDraw.Draw(combined_image)
+    font = ImageFont.truetype("tools/Tahoma.ttf", 30)
+    # Add Grounding DINO title
+    draw.text((10, 10), "Grounding DINO Output", fill="black", font=font)
+    # Add Rex-Thinker title
+    draw.text(
+        (gdino_vis_result.width + 10, 10), "Rex-Thinker Output", fill="black", font=font
+    )
+    # Save the combined visualization result
+    os.makedirs(os.path.dirname(args.vis_path), exist_ok=True)
+    combined_image.save(args.vis_path)

requirements.txt ADDED Viewed

	@@ -0,0 +1,24 @@

+accelerate
+codetiming
+datasets
+flash-attn>=2.4.3
+liger-kernel
+mathruler
+numpy
+omegaconf
+pandas
+peft
+pillow
+pyarrow>=15.0.0
+pylatexenc
+qwen-vl-utils
+ray[default]
+tensordict
+torchdata
+transformers==4.51.3
+vllm==0.8.2
+wandb
+tensorboard
+gradio==4.44.1
+pydantic==2.10.6
+tabulate