nielsr HF Staff commited on
Commit
4aa7742
·
verified ·
1 Parent(s): 8d633b2

Add pipeline tag and include Github README content

Browse files

This PR adds the `pipeline_tag: image-text-to-text` to the model card metadata. It also includes the Github README content into the model card, improving the documentation and discoverability of the model.

Files changed (1) hide show
  1. README.md +162 -8
README.md CHANGED
@@ -1,19 +1,173 @@
1
  ---
 
 
 
 
2
  library_name: transformers
 
 
 
3
  tags:
4
  - image
5
  - scene-graph
6
  - scene-graph-generation
7
- license: apache-2.0
8
- datasets:
9
- - JosephZ/vg150_train_sgg_prompt
10
- metrics:
11
- - recall
12
- base_model:
13
- - Qwen/Qwen2-VL-7B-Instruct
14
  ---
15
 
16
  # Model Description
17
 
18
  <!-- Provide a quick summary of what the model is/does. -->
19
- An end-to-end multimodal LLM for Scene Graph Generation (SGG), which was introduced in [Compile Scene Graphs with Reinforcement Learning](https://huggingface.co/papers/2504.13617
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ base_model:
3
+ - Qwen/Qwen2-VL-7B-Instruct
4
+ datasets:
5
+ - JosephZ/vg150_train_sgg_prompt
6
  library_name: transformers
7
+ license: apache-2.0
8
+ metrics:
9
+ - recall
10
  tags:
11
  - image
12
  - scene-graph
13
  - scene-graph-generation
14
+ pipeline_tag: image-text-to-text
 
 
 
 
 
 
15
  ---
16
 
17
  # Model Description
18
 
19
  <!-- Provide a quick summary of what the model is/does. -->
20
+ An end-to-end multimodal LLM for Scene Graph Generation (SGG), which was introduced in [Compile Scene Graphs with Reinforcement Learning](https://huggingface.co/papers/2504.13617)
21
+
22
+ # R1-SGG: Compile Scene Graphs with Reinforcement Learning
23
+
24
+ ## **Structured Visual Reasoning with Multimodal LLMs and Reinforcement Learning**
25
+ [![Paper](https://img.shields.io/badge/arXiv-2504.13617-b31b1b.svg)](https://arxiv.org/abs/2504.13617) [![License](https://img.shields.io/badge/license-Apache--2.0-blue.svg)](LICENSE) [![Hugging Face](https://img.shields.io/badge/HuggingFace-Demo-orange?logo=huggingface)](https://huggingface.co/spaces/JosephZ/R1-SGG)
26
+ ---
27
+
28
+ ## 🚀 Update
29
+ - ✅ ![Hugging Face](https://img.shields.io/badge/HuggingFace-Model-orange?logo=huggingface)[R1-SGG-7B](https://huggingface.co/JosephZ/R1-SGG-7B), [R1-SGG-Zero-7B](https://huggingface.co/JosephZ/R1-SGG-Zero-7B)
30
+ - ✅ Support [PSG](https://github.com/Jingkang50/OpenPSG) dataset (bbox format only, not Panoptic)
31
+ - ✅ Updated loss implementation
32
+ - ✅ Always use `custom_per_device_train_batch_size` instead of `per_device_train_batch_size` for faster sampling under gradient accumulation
33
+ - ⚠️ Current loss implementation might still be affected by gradient accumulation: [trl issue #3021](https://github.com/huggingface/trl/issues/3021)
34
+
35
+ ---
36
+
37
+ ## 🛠️ Setup Environment
38
+ ```bash
39
+ bash install.sh
40
+ ```
41
+ Main dependencies:
42
+ ```bash
43
+ - torch == 2.5.0 or 2.5.1 (cu124, optional)
44
+ - transformers (supports Qwen2VL, Qwen2.5VL)
45
+ - trl
46
+ - vLLM
47
+ ```
48
+
49
+ ---
50
+
51
+ ## 📚 Dataset
52
+ Load preprocessed datasets via:
53
+ ```python
54
+ from datasets import load_dataset
55
+
56
+ db_train = load_dataset("JosephZ/vg150_train_sgg_prompt")["train"]
57
+ db_val = load_dataset("JosephZ/vg150_val_sgg_prompt")["train"]
58
+ ```
59
+ or for PSG:
60
+ ```python
61
+ db_train = load_dataset("JosephZ/psg_train_sg")["train"] # keys: image_id, image, objects, relationships
62
+ db_val = load_dataset("JosephZ/psg_test_sg")["train"]
63
+ ```
64
+ We transformed VG150 into HuggingFace Datasets format with keys:
65
+ - `image_id`
66
+ - `image`
67
+ - `prompt_open`
68
+ - `prompt_close`
69
+ - `objects`
70
+ - `relationships`
71
+
72
+ ---
73
+
74
+ ## 🔥 Supported Models
75
+ - [x] Qwen/Qwen2-VL-2B-Instruct
76
+ - [x] Qwen/Qwen2-VL-7B-Instruct
77
+ - [x] Qwen/Qwen2.5-VL-3B-Instruct
78
+ - [x] Qwen/Qwen2.5-VL-7B-Instruct
79
+
80
+ ---
81
+
82
+ ## 🏋️‍♂️ Training
83
+
84
+ ### Training with Supervised Fine-Tuning (SFT)
85
+
86
+ For **SLURM users**:
87
+ ```bash
88
+ sbatch scripts/sft/7B_sgg.sh
89
+ ```
90
+
91
+ For **local machines**:
92
+ ```bash
93
+ bash scripts/sft_local/7B_sgg.sh
94
+ ```
95
+ ⏱️ Approximate training time:
96
+ - 2B models: ~4 hours (4×A100 SXM4 GPUs)
97
+ - 7B models: ~10 hours (4×A100 SXM4 GPUs)
98
+
99
+ ---
100
+
101
+ ### Training with Reinforcement Learning (GRPO)
102
+ ** Update (11/05/2025): to use "Hard Recall"**:
103
+ ```
104
+ --reward_funcs format_reward edge_hard_reward
105
+ ```
106
+
107
+ For **A100 GPUs**:
108
+ ```bash
109
+ sbatch scripts/grpo/train_a100_2B.sh
110
+ ```
111
+ (12 hours on 16×A100 GPUs)
112
+
113
+ For **GH200 GPUs**:
114
+ ```bash
115
+ sbatch scripts/grpo/train_gh200.sh
116
+ ```
117
+ (16 hours on 16×GH200 GPUs)
118
+
119
+ For clusters with many RTX_3090/4090 GPUs:
120
+ ```bash
121
+ sbatch scripts/grpo/train_fused.sh
122
+ ```
123
+ - Training 7B models on 24GB cards is possible with Zero3, but slow due to communication bottlenecks.
124
+ - (Fun fact: training with 120×RTX_4090 is crazy but severely limited by communication latency.)
125
+
126
+ 💡 **Recommended learning rate**: `6e-7`.
127
+
128
+ ---
129
+
130
+ ## 🧪 Inference and Evaluation
131
+
132
+ ### Inference with SFT-trained models:
133
+ ```bash
134
+ bash scripts/inference/run_sgg_inference.sh $DATASET $MODEL_NAME $OUTPUT_DIR
135
+ ```
136
+ For models trained **with predefined categories**, add `true`:
137
+ ```bash
138
+ bash scripts/inference/run_sgg_inference.sh $DATASET $MODEL_NAME $OUTPUT_DIR true
139
+ ```
140
+
141
+ ### Inference with GRPO-trained models:
142
+ ```bash
143
+ bash scripts/inference/run_sgg_inference.sh $DATASET $MODEL_NAME $OUTPUT_DIR false/true true
144
+ ```
145
+
146
+ ### Evaluation:
147
+ ```bash
148
+ DATASET_TYPE=vg # or psg
149
+ python src/sgg_gather_preds.py $DATASET_TYPE $OUTPUT_DIR sgg_pred_results.json
150
+ python src/vg150_eval.py $DATASET sgg_pred_results.json
151
+ ```
152
+
153
+ ---
154
+
155
+ ## 🤝 Acknowledgement
156
+ The `GRPOTrainer` used in this project is based on [trl's GRPOTrainer](https://github.com/huggingface/trl/blob/main/trl/trainer/grpo_trainer.py), extended to support multimodal inputs.
157
+
158
+ ---
159
+
160
+ ## 📖 Citation
161
+ If you find this work helpful, please cite:
162
+ ```bibtex
163
+ @article{chen2025compile,
164
+ title={Compile Scene Graphs with Reinforcement Learning},
165
+ author={Chen, Zuyao and Wu, Jinlin and Lei, Zhen and Pollefeys, Marc and Chen, Chang Wen},
166
+ journal={arXiv preprint arXiv:2504.13617},
167
+ year={2025}
168
+ }
169
+ ```
170
+
171
+ ---
172
+
173
+ # ✨ Happy Compiling!