lixinhao commited on
Commit
0fa455a
ยท
verified ยท
1 Parent(s): a4c5898

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +100 -3
README.md CHANGED
@@ -1,3 +1,100 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ library_name: transformers
5
+ license: apache-2.0
6
+ metrics:
7
+ - accuracy
8
+ tags:
9
+ - multimodal
10
+ pipeline_tag: video-text-to-text
11
+ base_model: Qwen/Qwen2-VL-7B-Instruct
12
+ ---
13
+
14
+
15
+ # ๐Ÿ’ก VideoChat-R1_7B_caption
16
+
17
+ [\[๐Ÿ“‚ GitHub\]](https://github.com/OpenGVLab/VideoChat-R1)
18
+ [\[๐Ÿ“œ Tech Report\]](https://arxiv.org/pdf/2504.06958)
19
+
20
+
21
+ ## ๐Ÿš€ How to use the model
22
+
23
+ We provide a simple installation example below:
24
+ ```
25
+ pip install transformers
26
+ pip install qwen_vl_utils
27
+ ```
28
+ Then you could use our model:
29
+ ```python
30
+ from transformers import Qwen2_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
31
+ from qwen_vl_utils import process_vision_info
32
+
33
+ model_path = "OpenGVLab/VideoChat-R1_7B_caption"
34
+ # default: Load the model on the available device(s)
35
+ model = Qwen2_VLForConditionalGeneration.from_pretrained(
36
+ model_path, torch_dtype="auto", device_map="auto",
37
+ attn_implementation="flash_attention_2"
38
+ )
39
+
40
+ # default processer
41
+ processor = AutoProcessor.from_pretrained(model_path)
42
+
43
+ video_path = "your_video.mp4"
44
+ question = "Describe the video in detail."
45
+
46
+ messages = [
47
+ {
48
+ "role": "user",
49
+ "content": [
50
+ {
51
+ "type": "video",
52
+ "video": video_path,
53
+ "max_pixels": 360 * 420,
54
+ "fps": 1.0,
55
+ },
56
+ {"type": "text", "text": f""""{question} First output the thinking process in <think> </think> tags and then output the final answer in <answer> </answer> tags"""},
57
+ ],
58
+ }
59
+ ]
60
+
61
+
62
+
63
+ #In Qwen 2 VL, frame rate information is also input into the model to align with absolute time.
64
+ # Preparation for inference
65
+ text = processor.apply_chat_template(
66
+ messages, tokenize=False, add_generation_prompt=True
67
+ )
68
+ image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
69
+ inputs = processor(
70
+ text=[text],
71
+ images=image_inputs,
72
+ videos=video_inputs,
73
+ padding=True,
74
+ return_tensors="pt",
75
+ **video_kwargs,
76
+ )
77
+ inputs = inputs.to("cuda")
78
+
79
+ # Inference
80
+ generated_ids = model.generate(**inputs, max_new_tokens=512)
81
+ generated_ids_trimmed = [
82
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
83
+ ]
84
+ output_text = processor.batch_decode(
85
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
86
+ )
87
+ print(output_text)
88
+ ```
89
+
90
+ ## โœ๏ธ Citation
91
+
92
+ ```bibtex
93
+
94
+ @article{li2025videochatr1,
95
+ title={VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning},
96
+ author={Li, Xinhao and Yan, Ziang and Meng, Desen and Dong, Lu and Zeng, Xiangyu and He, Yinan and Wang, Yali and Qiao, Yu and Wang, Yi and Wang, Limin},
97
+ journal={arXiv preprint arXiv:2504.06958},
98
+ year={2025}
99
+ }
100
+ ```