Add pipeline tag and library name, include information from github README

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +129 -2
README.md CHANGED
@@ -1,7 +1,9 @@
1
  ---
2
- license: apache-2.0
3
  base_model:
4
  - lmms-lab/LLaVA-Video-7B-Qwen2
 
 
 
5
  ---
6
 
7
  <a href='https://arxiv.org/abs/2504.12083'><img src='https://img.shields.io/badge/arXiv-paper-red'></a>
@@ -18,7 +20,19 @@ Clone the repository and navigate to the RRPO directory:
18
  ```sh
19
  git clone https://github.com/pritamqu/RRPO
20
  cd RRPO
 
 
 
 
 
 
 
 
 
 
 
21
 
 
22
  conda create -n llava python=3.10 -y
23
  conda activate llava
24
  pip install -r llavavideo.txt
@@ -44,4 +58,117 @@ python inference.py \
44
  --video_path "sample_video.mp4" \
45
  --question "Describe this video." \
46
  --model_max_length 1024
47
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
2
  base_model:
3
  - lmms-lab/LLaVA-Video-7B-Qwen2
4
+ license: apache-2.0
5
+ pipeline_tag: video-text-to-text
6
+ library_name: transformers
7
  ---
8
 
9
  <a href='https://arxiv.org/abs/2504.12083'><img src='https://img.shields.io/badge/arXiv-paper-red'></a>
 
20
  ```sh
21
  git clone https://github.com/pritamqu/RRPO
22
  cd RRPO
23
+ ```
24
+
25
+ This repository supports three Large Video Language Models (LVLMs), each with its own dependency requirements:
26
+
27
+ - **[VideoChat2](https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat2)**: `videochat2.txt`
28
+ - **[LLaVA-Video](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/README.md)**: `llavavideo.txt`
29
+ - **[LongVU](https://github.com/Vision-CAIR/LongVU)**: `longvu.txt`
30
+
31
+ #### Example: Setting up LLaVA-Video
32
+
33
+ Follow similar steps for other models.
34
 
35
+ ```sh
36
  conda create -n llava python=3.10 -y
37
  conda activate llava
38
  pip install -r llavavideo.txt
 
58
  --video_path "sample_video.mp4" \
59
  --question "Describe this video." \
60
  --model_max_length 1024
61
+ ```
62
+
63
+ ## Dataset
64
+
65
+ Our training data is released here [Self-Alignment Dataset](https://huggingface.co/datasets/pritamqu/self-alignment). We release the preferred and non-preferred responses used in self-alignment training.
66
+ ```
67
+ git clone [email protected]:datasets/pritamqu/self-alignment
68
+ ```
69
+ The related videos can be downloaded from their original sources. Please check [VideoChat-IT](https://github.com/OpenGVLab/Ask-Anything/blob/main/video_chat2/DATA.md) GitHub page regarding the details of downloading the source videos.
70
+
71
+ We also share additional details on how to use your own data [here](docs/DATA.md).
72
+
73
+ ## Training
74
+
75
+ Before training, make sure to prepare the data and download the weights of the base models. Then you can launch the training jobs as:
76
+
77
+ VideoChat2
78
+ ```
79
+ bash scripts/videochat2/run.sh
80
+ ```
81
+ LLaVA-Video
82
+ ```
83
+ bash scripts/llavavideo/run.sh
84
+ ```
85
+ LongVU
86
+ ```
87
+ bash scripts/longvu/run.sh
88
+ ```
89
+ The link to the base model weights are:
90
+ - [VideoChat2_stage3_Mistral_7B](https://huggingface.co/OpenGVLab/VideoChat2_stage3_Mistral_7B)
91
+ - [LLaVA-Video-7B-Qwen2](https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2)
92
+ - [LongVU_Qwen2_7B](https://huggingface.co/Vision-CAIR/LongVU_Qwen2_7B)
93
+
94
+
95
+ ## Inference
96
+
97
+ We provide a simple setup to inference using our trained model.
98
+
99
+ **VideoChat2**
100
+ ```
101
+ bash scripts/inference_videochat2.sh
102
+ ```
103
+
104
+ **LLaVA-Video**
105
+ ```
106
+ bash scripts/inference_llavavideo.sh
107
+ ```
108
+
109
+ **LongVU**
110
+ ```
111
+ bash scripts/inference_longvu.sh
112
+ ```
113
+
114
+ ## Results
115
+
116
+ **RRPO shows consistent improvements over the base model and outperforms DPO across all benchmarks.**
117
+
118
+ | **Models** | **#F** | **TV Bench** | **Temp Compass** | **Video Hallucer** | **Vid Halluc** | **MV Bench** | **Video MME** | **MLVU** | **LongVideo Bench** |
119
+ |------------|------|-------------|----------------|----------------|-------------|-------------|-------------|--------|------------------|
120
+ | VideoChat2 | 16 | 44.0 | 59.3 | 23.1 | 73.3 | **60.2** | 41.0 | 46.4 | 40.4 |
121
+ | VideoChat2 + DPO | 16 | 45.7 | 60.0 | 22.1 | 72.4 | 59.6 | 43.0 | 47.4 | 41.0 |
122
+ | VideoChat2 + **RRPO** | 16 | **45.8** | **60.2** | **32.9** | **76.4** | 59.0 | **44.3** | **47.9** | **42.8** |
123
+ | | | | | | | | | | |
124
+ | LLaVA-Video | 64 | 51.0 | 66.0 | 50.0 | 76.6 | 61.1 | 64.0 | 68.6 | 60.1 |
125
+ | LLaVA-Video + DPO | 64 | 51.9 | 66.4 | 53.3 | 76.5 | 60.6 | 63.1 | 67.4 | 59.4 |
126
+ | LLaVA-Video + **RRPO** | 64 | 51.9 | 66.8 | 55.7 | 76.5 | **62.2** | **64.5** | 69.1 | **60.4** |
127
+ | LLaVA-Video + **RRPO** (32f) | 64 | **52.2** | **67.4** | **55.8** | **76.6** | 62.1 | **64.5** | **69.4** | 60.1 |
128
+ | | | | | | | | | | |
129
+ | LongVU | 1fps | 53.7 | 63.9 | 39.2 | 67.3 | 65.5 | 56.2 | 63.6 | 48.6 |
130
+ | LongVU + DPO | 1fps | 54.3 | 64.3 | 40.9 | 68.5 | 65.9 | 56.6 | 63.6 | 49.4 |
131
+ | LongVU + **RRPO** | 1fps | **56.5** | **64.5** | **44.0** | **71.7** | **66.8** | **57.7** | **64.5** | **49.7** |
132
+
133
+
134
+ ## Evaluation
135
+
136
+ You can download evaluation benchmarks from the given links:
137
+
138
+ - [TVBench](https://huggingface.co/datasets/FunAILab/TVBench)
139
+ - [TempCompass](https://huggingface.co/datasets/lmms-lab/TempCompass)
140
+ - [VideoHallucer](https://huggingface.co/datasets/bigai-nlco/VideoHallucer)
141
+ - [VidHalluc](https://huggingface.co/datasets/chaoyuli/VidHalluc)
142
+ - [MVBench](https://huggingface.co/datasets/PKU-Alignment/MVBench)
143
+ - [VideoMME](https://huggingface.co/datasets/lmms-lab/Video-MME)
144
+ - [MLVU](https://huggingface.co/datasets/MLVU/MVLU)
145
+ - [LongVideoBench](https://huggingface.co/datasets/longvideobench/LongVideoBench)
146
+
147
+ Next, you can run the entire evaluations following the instructions provided [here](./docs/EVALUATION.md).
148
+
149
+
150
+ ## Citation
151
+
152
+ If you find this work useful, please consider citing our paper:
153
+
154
+ ```
155
+ @article{sarkar2025rrpo,
156
+ title={Self-Alignment of Large Video Language Models with Refined Regularized Preference Optimization},
157
+ author={Your Name et al.},
158
+ journal={arXiv preprint arXiv:2504.12083},
159
+ year={2025}
160
+ }
161
+ ```
162
+
163
+ ## Usage and License Notices
164
+
165
+ This project incorporates datasets and model checkpoints that are subject to their respective original licenses. Users must adhere to the terms and conditions specified by these licenses.
166
+ The assets used in this work include, but are not limited to:
167
+ [VideoChat2-IT](https://huggingface.co/datasets/OpenGVLab/VideoChat2-IT),
168
+ [VideoChat2_stage3_Mistral_7B](https://huggingface.co/OpenGVLab/VideoChat2_stage3_Mistral_7B),
169
+ [LLaVA-Video-7B-Qwen2](https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2),
170
+ [LongVU_Qwen2_7B](https://huggingface.co/Vision-CAIR/LongVU_Qwen2_7B). This project does not impose any additional constraints beyond those stipulated in the original licenses. Users must ensure their usage complies with all applicable laws and regulations.
171
+ This repository is released under the **Apache 2.0 License**. See [LICENSE](LICENSE) for details.
172
+
173
+ ---
174
+ For any issues or questions, please open an issue or contact **Pritam Sarkar** at [email protected]!