nielsr HF Staff commited on
Commit
eee297d
·
verified ·
1 Parent(s): 94ac5e3

Add pipeline tag and library name, include information from github README

Browse files

This PR adds the `pipeline_tag` and `library_name` to the model card metadata.
The `pipeline_tag` is set to `video-text-to-text` reflecting the model's functionality. The `library_name` is set to `transformers` based on the model's compatibility.
It also includes additional information from the github README, such as dataset info, training, results, evaluation, citation and usage.

Files changed (1) hide show
  1. README.md +129 -2
README.md CHANGED
@@ -1,7 +1,9 @@
1
  ---
2
- license: apache-2.0
3
  base_model:
4
  - lmms-lab/LLaVA-Video-7B-Qwen2
 
 
 
5
  ---
6
 
7
  <a href='https://arxiv.org/abs/2504.12083'><img src='https://img.shields.io/badge/arXiv-paper-red'></a>
@@ -18,7 +20,19 @@ Clone the repository and navigate to the RRPO directory:
18
  ```sh
19
  git clone https://github.com/pritamqu/RRPO
20
  cd RRPO
 
 
 
 
 
 
 
 
 
 
 
21
 
 
22
  conda create -n llava python=3.10 -y
23
  conda activate llava
24
  pip install -r llavavideo.txt
@@ -44,4 +58,117 @@ python inference.py \
44
  --video_path "sample_video.mp4" \
45
  --question "Describe this video." \
46
  --model_max_length 1024
47
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
2
  base_model:
3
  - lmms-lab/LLaVA-Video-7B-Qwen2
4
+ license: apache-2.0
5
+ pipeline_tag: video-text-to-text
6
+ library_name: transformers
7
  ---
8
 
9
  <a href='https://arxiv.org/abs/2504.12083'><img src='https://img.shields.io/badge/arXiv-paper-red'></a>
 
20
  ```sh
21
  git clone https://github.com/pritamqu/RRPO
22
  cd RRPO
23
+ ```
24
+
25
+ This repository supports three Large Video Language Models (LVLMs), each with its own dependency requirements:
26
+
27
+ - **[VideoChat2](https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat2)**: `videochat2.txt`
28
+ - **[LLaVA-Video](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/README.md)**: `llavavideo.txt`
29
+ - **[LongVU](https://github.com/Vision-CAIR/LongVU)**: `longvu.txt`
30
+
31
+ #### Example: Setting up LLaVA-Video
32
+
33
+ Follow similar steps for other models.
34
 
35
+ ```sh
36
  conda create -n llava python=3.10 -y
37
  conda activate llava
38
  pip install -r llavavideo.txt
 
58
  --video_path "sample_video.mp4" \
59
  --question "Describe this video." \
60
  --model_max_length 1024
61
+ ```
62
+
63
+ ## Dataset
64
+
65
+ Our training data is released here [Self-Alignment Dataset](https://huggingface.co/datasets/pritamqu/self-alignment). We release the preferred and non-preferred responses used in self-alignment training.
66
+ ```
67
+ git clone [email protected]:datasets/pritamqu/self-alignment
68
+ ```
69
+ The related videos can be downloaded from their original sources. Please check [VideoChat-IT](https://github.com/OpenGVLab/Ask-Anything/blob/main/video_chat2/DATA.md) GitHub page regarding the details of downloading the source videos.
70
+
71
+ We also share additional details on how to use your own data [here](docs/DATA.md).
72
+
73
+ ## Training
74
+
75
+ Before training, make sure to prepare the data and download the weights of the base models. Then you can launch the training jobs as:
76
+
77
+ VideoChat2
78
+ ```
79
+ bash scripts/videochat2/run.sh
80
+ ```
81
+ LLaVA-Video
82
+ ```
83
+ bash scripts/llavavideo/run.sh
84
+ ```
85
+ LongVU
86
+ ```
87
+ bash scripts/longvu/run.sh
88
+ ```
89
+ The link to the base model weights are:
90
+ - [VideoChat2_stage3_Mistral_7B](https://huggingface.co/OpenGVLab/VideoChat2_stage3_Mistral_7B)
91
+ - [LLaVA-Video-7B-Qwen2](https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2)
92
+ - [LongVU_Qwen2_7B](https://huggingface.co/Vision-CAIR/LongVU_Qwen2_7B)
93
+
94
+
95
+ ## Inference
96
+
97
+ We provide a simple setup to inference using our trained model.
98
+
99
+ **VideoChat2**
100
+ ```
101
+ bash scripts/inference_videochat2.sh
102
+ ```
103
+
104
+ **LLaVA-Video**
105
+ ```
106
+ bash scripts/inference_llavavideo.sh
107
+ ```
108
+
109
+ **LongVU**
110
+ ```
111
+ bash scripts/inference_longvu.sh
112
+ ```
113
+
114
+ ## Results
115
+
116
+ **RRPO shows consistent improvements over the base model and outperforms DPO across all benchmarks.**
117
+
118
+ | **Models** | **#F** | **TV Bench** | **Temp Compass** | **Video Hallucer** | **Vid Halluc** | **MV Bench** | **Video MME** | **MLVU** | **LongVideo Bench** |
119
+ |------------|------|-------------|----------------|----------------|-------------|-------------|-------------|--------|------------------|
120
+ | VideoChat2 | 16 | 44.0 | 59.3 | 23.1 | 73.3 | **60.2** | 41.0 | 46.4 | 40.4 |
121
+ | VideoChat2 + DPO | 16 | 45.7 | 60.0 | 22.1 | 72.4 | 59.6 | 43.0 | 47.4 | 41.0 |
122
+ | VideoChat2 + **RRPO** | 16 | **45.8** | **60.2** | **32.9** | **76.4** | 59.0 | **44.3** | **47.9** | **42.8** |
123
+ | | | | | | | | | | |
124
+ | LLaVA-Video | 64 | 51.0 | 66.0 | 50.0 | 76.6 | 61.1 | 64.0 | 68.6 | 60.1 |
125
+ | LLaVA-Video + DPO | 64 | 51.9 | 66.4 | 53.3 | 76.5 | 60.6 | 63.1 | 67.4 | 59.4 |
126
+ | LLaVA-Video + **RRPO** | 64 | 51.9 | 66.8 | 55.7 | 76.5 | **62.2** | **64.5** | 69.1 | **60.4** |
127
+ | LLaVA-Video + **RRPO** (32f) | 64 | **52.2** | **67.4** | **55.8** | **76.6** | 62.1 | **64.5** | **69.4** | 60.1 |
128
+ | | | | | | | | | | |
129
+ | LongVU | 1fps | 53.7 | 63.9 | 39.2 | 67.3 | 65.5 | 56.2 | 63.6 | 48.6 |
130
+ | LongVU + DPO | 1fps | 54.3 | 64.3 | 40.9 | 68.5 | 65.9 | 56.6 | 63.6 | 49.4 |
131
+ | LongVU + **RRPO** | 1fps | **56.5** | **64.5** | **44.0** | **71.7** | **66.8** | **57.7** | **64.5** | **49.7** |
132
+
133
+
134
+ ## Evaluation
135
+
136
+ You can download evaluation benchmarks from the given links:
137
+
138
+ - [TVBench](https://huggingface.co/datasets/FunAILab/TVBench)
139
+ - [TempCompass](https://huggingface.co/datasets/lmms-lab/TempCompass)
140
+ - [VideoHallucer](https://huggingface.co/datasets/bigai-nlco/VideoHallucer)
141
+ - [VidHalluc](https://huggingface.co/datasets/chaoyuli/VidHalluc)
142
+ - [MVBench](https://huggingface.co/datasets/PKU-Alignment/MVBench)
143
+ - [VideoMME](https://huggingface.co/datasets/lmms-lab/Video-MME)
144
+ - [MLVU](https://huggingface.co/datasets/MLVU/MVLU)
145
+ - [LongVideoBench](https://huggingface.co/datasets/longvideobench/LongVideoBench)
146
+
147
+ Next, you can run the entire evaluations following the instructions provided [here](./docs/EVALUATION.md).
148
+
149
+
150
+ ## Citation
151
+
152
+ If you find this work useful, please consider citing our paper:
153
+
154
+ ```
155
+ @article{sarkar2025rrpo,
156
+ title={Self-Alignment of Large Video Language Models with Refined Regularized Preference Optimization},
157
+ author={Your Name et al.},
158
+ journal={arXiv preprint arXiv:2504.12083},
159
+ year={2025}
160
+ }
161
+ ```
162
+
163
+ ## Usage and License Notices
164
+
165
+ This project incorporates datasets and model checkpoints that are subject to their respective original licenses. Users must adhere to the terms and conditions specified by these licenses.
166
+ The assets used in this work include, but are not limited to:
167
+ [VideoChat2-IT](https://huggingface.co/datasets/OpenGVLab/VideoChat2-IT),
168
+ [VideoChat2_stage3_Mistral_7B](https://huggingface.co/OpenGVLab/VideoChat2_stage3_Mistral_7B),
169
+ [LLaVA-Video-7B-Qwen2](https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2),
170
+ [LongVU_Qwen2_7B](https://huggingface.co/Vision-CAIR/LongVU_Qwen2_7B). This project does not impose any additional constraints beyond those stipulated in the original licenses. Users must ensure their usage complies with all applicable laws and regulations.
171
+ This repository is released under the **Apache 2.0 License**. See [LICENSE](LICENSE) for details.
172
+
173
+ ---
174
+ For any issues or questions, please open an issue or contact **Pritam Sarkar** at [email protected]!