Add pipeline tag and library name, include information from github README
Browse filesThis PR adds the `pipeline_tag` and `library_name` to the model card metadata.
The `pipeline_tag` is set to `video-text-to-text` reflecting the model's functionality. The `library_name` is set to `transformers` based on the model's compatibility.
It also includes additional information from the github README, such as dataset info, training, results, evaluation, citation and usage.
README.md
CHANGED
@@ -1,7 +1,9 @@
|
|
1 |
---
|
2 |
-
license: apache-2.0
|
3 |
base_model:
|
4 |
- lmms-lab/LLaVA-Video-7B-Qwen2
|
|
|
|
|
|
|
5 |
---
|
6 |
|
7 |
<a href='https://arxiv.org/abs/2504.12083'><img src='https://img.shields.io/badge/arXiv-paper-red'></a>
|
@@ -18,7 +20,19 @@ Clone the repository and navigate to the RRPO directory:
|
|
18 |
```sh
|
19 |
git clone https://github.com/pritamqu/RRPO
|
20 |
cd RRPO
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
21 |
|
|
|
22 |
conda create -n llava python=3.10 -y
|
23 |
conda activate llava
|
24 |
pip install -r llavavideo.txt
|
@@ -44,4 +58,117 @@ python inference.py \
|
|
44 |
--video_path "sample_video.mp4" \
|
45 |
--question "Describe this video." \
|
46 |
--model_max_length 1024
|
47 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
|
|
2 |
base_model:
|
3 |
- lmms-lab/LLaVA-Video-7B-Qwen2
|
4 |
+
license: apache-2.0
|
5 |
+
pipeline_tag: video-text-to-text
|
6 |
+
library_name: transformers
|
7 |
---
|
8 |
|
9 |
<a href='https://arxiv.org/abs/2504.12083'><img src='https://img.shields.io/badge/arXiv-paper-red'></a>
|
|
|
20 |
```sh
|
21 |
git clone https://github.com/pritamqu/RRPO
|
22 |
cd RRPO
|
23 |
+
```
|
24 |
+
|
25 |
+
This repository supports three Large Video Language Models (LVLMs), each with its own dependency requirements:
|
26 |
+
|
27 |
+
- **[VideoChat2](https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat2)**: `videochat2.txt`
|
28 |
+
- **[LLaVA-Video](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/README.md)**: `llavavideo.txt`
|
29 |
+
- **[LongVU](https://github.com/Vision-CAIR/LongVU)**: `longvu.txt`
|
30 |
+
|
31 |
+
#### Example: Setting up LLaVA-Video
|
32 |
+
|
33 |
+
Follow similar steps for other models.
|
34 |
|
35 |
+
```sh
|
36 |
conda create -n llava python=3.10 -y
|
37 |
conda activate llava
|
38 |
pip install -r llavavideo.txt
|
|
|
58 |
--video_path "sample_video.mp4" \
|
59 |
--question "Describe this video." \
|
60 |
--model_max_length 1024
|
61 |
+
```
|
62 |
+
|
63 |
+
## Dataset
|
64 |
+
|
65 |
+
Our training data is released here [Self-Alignment Dataset](https://huggingface.co/datasets/pritamqu/self-alignment). We release the preferred and non-preferred responses used in self-alignment training.
|
66 |
+
```
|
67 |
+
git clone [email protected]:datasets/pritamqu/self-alignment
|
68 |
+
```
|
69 |
+
The related videos can be downloaded from their original sources. Please check [VideoChat-IT](https://github.com/OpenGVLab/Ask-Anything/blob/main/video_chat2/DATA.md) GitHub page regarding the details of downloading the source videos.
|
70 |
+
|
71 |
+
We also share additional details on how to use your own data [here](docs/DATA.md).
|
72 |
+
|
73 |
+
## Training
|
74 |
+
|
75 |
+
Before training, make sure to prepare the data and download the weights of the base models. Then you can launch the training jobs as:
|
76 |
+
|
77 |
+
VideoChat2
|
78 |
+
```
|
79 |
+
bash scripts/videochat2/run.sh
|
80 |
+
```
|
81 |
+
LLaVA-Video
|
82 |
+
```
|
83 |
+
bash scripts/llavavideo/run.sh
|
84 |
+
```
|
85 |
+
LongVU
|
86 |
+
```
|
87 |
+
bash scripts/longvu/run.sh
|
88 |
+
```
|
89 |
+
The link to the base model weights are:
|
90 |
+
- [VideoChat2_stage3_Mistral_7B](https://huggingface.co/OpenGVLab/VideoChat2_stage3_Mistral_7B)
|
91 |
+
- [LLaVA-Video-7B-Qwen2](https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2)
|
92 |
+
- [LongVU_Qwen2_7B](https://huggingface.co/Vision-CAIR/LongVU_Qwen2_7B)
|
93 |
+
|
94 |
+
|
95 |
+
## Inference
|
96 |
+
|
97 |
+
We provide a simple setup to inference using our trained model.
|
98 |
+
|
99 |
+
**VideoChat2**
|
100 |
+
```
|
101 |
+
bash scripts/inference_videochat2.sh
|
102 |
+
```
|
103 |
+
|
104 |
+
**LLaVA-Video**
|
105 |
+
```
|
106 |
+
bash scripts/inference_llavavideo.sh
|
107 |
+
```
|
108 |
+
|
109 |
+
**LongVU**
|
110 |
+
```
|
111 |
+
bash scripts/inference_longvu.sh
|
112 |
+
```
|
113 |
+
|
114 |
+
## Results
|
115 |
+
|
116 |
+
**RRPO shows consistent improvements over the base model and outperforms DPO across all benchmarks.**
|
117 |
+
|
118 |
+
| **Models** | **#F** | **TV Bench** | **Temp Compass** | **Video Hallucer** | **Vid Halluc** | **MV Bench** | **Video MME** | **MLVU** | **LongVideo Bench** |
|
119 |
+
|------------|------|-------------|----------------|----------------|-------------|-------------|-------------|--------|------------------|
|
120 |
+
| VideoChat2 | 16 | 44.0 | 59.3 | 23.1 | 73.3 | **60.2** | 41.0 | 46.4 | 40.4 |
|
121 |
+
| VideoChat2 + DPO | 16 | 45.7 | 60.0 | 22.1 | 72.4 | 59.6 | 43.0 | 47.4 | 41.0 |
|
122 |
+
| VideoChat2 + **RRPO** | 16 | **45.8** | **60.2** | **32.9** | **76.4** | 59.0 | **44.3** | **47.9** | **42.8** |
|
123 |
+
| | | | | | | | | | |
|
124 |
+
| LLaVA-Video | 64 | 51.0 | 66.0 | 50.0 | 76.6 | 61.1 | 64.0 | 68.6 | 60.1 |
|
125 |
+
| LLaVA-Video + DPO | 64 | 51.9 | 66.4 | 53.3 | 76.5 | 60.6 | 63.1 | 67.4 | 59.4 |
|
126 |
+
| LLaVA-Video + **RRPO** | 64 | 51.9 | 66.8 | 55.7 | 76.5 | **62.2** | **64.5** | 69.1 | **60.4** |
|
127 |
+
| LLaVA-Video + **RRPO** (32f) | 64 | **52.2** | **67.4** | **55.8** | **76.6** | 62.1 | **64.5** | **69.4** | 60.1 |
|
128 |
+
| | | | | | | | | | |
|
129 |
+
| LongVU | 1fps | 53.7 | 63.9 | 39.2 | 67.3 | 65.5 | 56.2 | 63.6 | 48.6 |
|
130 |
+
| LongVU + DPO | 1fps | 54.3 | 64.3 | 40.9 | 68.5 | 65.9 | 56.6 | 63.6 | 49.4 |
|
131 |
+
| LongVU + **RRPO** | 1fps | **56.5** | **64.5** | **44.0** | **71.7** | **66.8** | **57.7** | **64.5** | **49.7** |
|
132 |
+
|
133 |
+
|
134 |
+
## Evaluation
|
135 |
+
|
136 |
+
You can download evaluation benchmarks from the given links:
|
137 |
+
|
138 |
+
- [TVBench](https://huggingface.co/datasets/FunAILab/TVBench)
|
139 |
+
- [TempCompass](https://huggingface.co/datasets/lmms-lab/TempCompass)
|
140 |
+
- [VideoHallucer](https://huggingface.co/datasets/bigai-nlco/VideoHallucer)
|
141 |
+
- [VidHalluc](https://huggingface.co/datasets/chaoyuli/VidHalluc)
|
142 |
+
- [MVBench](https://huggingface.co/datasets/PKU-Alignment/MVBench)
|
143 |
+
- [VideoMME](https://huggingface.co/datasets/lmms-lab/Video-MME)
|
144 |
+
- [MLVU](https://huggingface.co/datasets/MLVU/MVLU)
|
145 |
+
- [LongVideoBench](https://huggingface.co/datasets/longvideobench/LongVideoBench)
|
146 |
+
|
147 |
+
Next, you can run the entire evaluations following the instructions provided [here](./docs/EVALUATION.md).
|
148 |
+
|
149 |
+
|
150 |
+
## Citation
|
151 |
+
|
152 |
+
If you find this work useful, please consider citing our paper:
|
153 |
+
|
154 |
+
```
|
155 |
+
@article{sarkar2025rrpo,
|
156 |
+
title={Self-Alignment of Large Video Language Models with Refined Regularized Preference Optimization},
|
157 |
+
author={Your Name et al.},
|
158 |
+
journal={arXiv preprint arXiv:2504.12083},
|
159 |
+
year={2025}
|
160 |
+
}
|
161 |
+
```
|
162 |
+
|
163 |
+
## Usage and License Notices
|
164 |
+
|
165 |
+
This project incorporates datasets and model checkpoints that are subject to their respective original licenses. Users must adhere to the terms and conditions specified by these licenses.
|
166 |
+
The assets used in this work include, but are not limited to:
|
167 |
+
[VideoChat2-IT](https://huggingface.co/datasets/OpenGVLab/VideoChat2-IT),
|
168 |
+
[VideoChat2_stage3_Mistral_7B](https://huggingface.co/OpenGVLab/VideoChat2_stage3_Mistral_7B),
|
169 |
+
[LLaVA-Video-7B-Qwen2](https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2),
|
170 |
+
[LongVU_Qwen2_7B](https://huggingface.co/Vision-CAIR/LongVU_Qwen2_7B). This project does not impose any additional constraints beyond those stipulated in the original licenses. Users must ensure their usage complies with all applicable laws and regulations.
|
171 |
+
This repository is released under the **Apache 2.0 License**. See [LICENSE](LICENSE) for details.
|
172 |
+
|
173 |
+
---
|
174 |
+
For any issues or questions, please open an issue or contact **Pritam Sarkar** at [email protected]!
|