nkkbr commited on
Commit
ed35572
·
1 Parent(s): 314ad38

update readme

Browse files
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ *.json filter=lfs diff=lfs merge=lfs -text
37
+ *.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,7 +1,342 @@
1
  ---
2
- base_model: lmms-lab/LLaVA-Video-7B-Qwen2
3
  tags:
4
- - llava
5
  - vision-language
6
- - fine-tuned
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  ---
 
 
1
  ---
2
+ license: apache-2.0
3
  tags:
4
+ - multimodal
5
  - vision-language
6
+ - video understanding
7
+ - spatial reasoning
8
+ - visuospatial cognition
9
+ - llava
10
+ - qwen
11
+ - llava-video
12
+ datasets:
13
+ - nkkbr/ViCA-322K
14
+ - nkkbr/ViCA-thinking-2.68k
15
+ language:
16
+ - en
17
+ library_name: transformers
18
+ pipeline_tag: visual-question-answering
19
+ model_name: ViCA-7B
20
+ base_model: lmms-lab/LLaVA-Video-7B-Qwen2
21
+ ---
22
+
23
+ # ViCA-7B: Visuospatial Cognitive Assistant
24
+
25
+ ## Overview
26
+
27
+ **ViCA-7B** is a vision-language model specifically fine-tuned for *visuospatial reasoning* in indoor video environments. Built upon the LLaVA-Video-7B-Qwen2 architecture, it is trained using our newly proposed **ViCA-322K dataset**, which emphasizes both structured spatial annotations and complex instruction-based reasoning tasks.
28
+
29
+ ViCA-7B achieves **state-of-the-art performance** on [VSI-Bench](https://github.com/vision-x-nyu/thinking-in-space), outperforming both proprietary models like **GPT-4o** and **Gemini-1.5 Pro**, as well as larger open-source baselines.
30
+
31
+ > **ViCA-7B sets a new standard for open-source multimodal spatial reasoning on indoor videos, making it a strong candidate for embodied AI and robotics use cases.**
32
+
33
+ <p align="center">
34
+ <img src="assets/vsi-bench-comparison.svg" width="700"/>
35
+ </p>
36
+
37
+ <p align="center"><b>Figure 1:</b> Performance comparison of ViCA-7B and other models on <a href="https://github.com/vision-x-nyu/thinking-in-space">VSI-Bench</a>.</p>
38
+
39
+
40
+ ## Model Architecture and Training Strategy
41
+
42
+ ViCA-7B is built upon the [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) framework, using **Qwen2-7B** as the language backbone and **SigLIP** as the visual encoder.
43
+
44
+ **Key Training Features**
45
+
46
+ - **Fixed-Length Visual Tokenization**
47
+ Each video is uniformly sampled into 64 frames, and each frame is encoded into 210 visual tokens, resulting in a total of **13,440 visual tokens per example**. This fixed-length design ensures consistent memory usage and stable optimization across batches.
48
+
49
+ - **Multimodal Alignment via Lightweight Projector**
50
+ A simple MLP-based projector maps visual embeddings into the language embedding space, enabling effective fusion between video content and textual prompts during both training and inference.
51
+
52
+ - **Efficient Distributed Training with DeepSpeed**
53
+ Training is conducted using **DeepSpeed ZeRO-3 Offload** on **8× NVIDIA H100 80GB GPUs**, with full parameter and optimizer state partitioning across devices. This setup supports large batch sizes and minimizes GPU memory overhead.
54
+
55
+ - **Mixed-Precision Computation (fp16)**
56
+ We adopt **mixed-precision training (fp16)** to accelerate computation and reduce memory usage, without compromising accuracy. This is combined with ZeRO-3 partitioning to further enhance training scalability.
57
+
58
+
59
+ The training was conducted over **55 hours**, covering both base and complex spatial reasoning subsets.
60
+
61
+ ## Training Dynamics
62
+
63
+ <p align="center">
64
+ <img src="assets/training_record/vica-train_loss_with_ema.svg" width="30%"/>
65
+ <img src="assets/training_record/vica-train_learning_rate.svg" width="30%"/>
66
+ <img src="assets/training_record/vica-train_grad_norm.svg" width="30%"/>
67
+ </p>
68
+
69
+ <p align="center">
70
+ <b>Figure 2:</b> Training loss, learning rate schedule, and gradient norm curves during ViCA-7B fine-tuning.
71
+ These curves illustrate a stable optimization process and smooth convergence under the DeepSpeed ZeRO-3 setup.
72
+ </p>
73
+
74
+ ## Dataset
75
+
76
+ ViCA-7B is fine-tuned on two complementary datasets:
77
+
78
+ - [**ViCA-322K**](https://huggingface.co/datasets/nkkbr/ViCA-322K):
79
+ A large-scale dataset covering both **base spatial reasoning tasks** (e.g., object distance, size, count, appearance order) and **complex spatial reasoning tasks** involving natural language questions and scene understanding. This dataset forms the core of the model's spatial reasoning capabilities.
80
+
81
+ - [**ViCA-thinking-2.68k**](https://huggingface.co/datasets/nkkbr/ViCA-thinking-2.68k):
82
+ A focused dataset used for instruction tuning to enhance the model's ability to **generate step-by-step reasoning traces** before outputting final answers. This supports more interpretable and cognitively-aligned response generation.
83
+
84
+ For details, please refer to the individual dataset pages linked above.
85
+
86
+ ## Evaluation: VSI-BENCH Benchmark
87
+
88
+ <p align="center">
89
+ <img src="assets/vsi-bench-table.png" width="800"/>
90
+ </p>
91
+
92
+ <p align="center"><b>Figure 3:</b> Quantitative comparison of ViCA-7B and baseline models on <a href="https://github.com/vision-x-nyu/thinking-in-space">VSI-Bench</a>. ViCA-7B achieves the best overall performance across both numerical and multiple-choice tasks.</p>
93
+
94
+ ### Effect of CSR Data
95
+
96
+ | Configuration | Avg Score |
97
+ |----------------------|-----------|
98
+ | Base-only (281K) | 55.39 |
99
+ | Full with CSR (322K) | **60.14** |
100
+
101
+ > CSR(Complex Spatial Reasoning) boosts generalization and **accelerates learning**, with notable performance jumps at intermediate checkpoints (e.g., +2.02 at 50–55%).
102
+
103
+ ### Data Scale vs. Performance
104
+
105
+ Performance improves significantly between **5% → 60%** of data usage. After **80%**, improvements plateau, indicating dataset is well-matched to model capacity.
106
+
107
+ <p align="center">
108
+ <img src="assets/data-scale-csr-effect.svg" width="750"/>
109
+ </p>
110
+
111
+ <p align="center"><b>Figure 4:</b> Performance of ViCA-7B under varying training data sizes (from 5% to 100%). The full dataset (including Complex Spatial Reasoning, CSR) consistently outperforms the base-only configuration. Notably, the CSR-enhanced model shows a +2.02 score jump between 50% and 55%, and a final performance gain of +4.75 at full scale. Performance plateaus beyond 80%, indicating the dataset is well-aligned with the model capacity.</p>
112
+
113
+ ## Intermediate Checkpoints and Evaluation Outputs
114
+
115
+ To support detailed analysis and reproducibility, we provide two sets of intermediate checkpoints saved at every **5% increment** of the training data. These models are trained for a single epoch and are useful for understanding how performance evolves as training progresses.
116
+
117
+ We also release the corresponding **raw evaluation outputs** (e.g., `.json` prediction files) for each checkpoint.
118
+ The evaluation script used to produce these outputs is available in our [GitHub repository](https://github.com/nkkbr/ViCA).
119
+
120
+ ### Full Dataset (ViCA-322K: Base + CSR)
121
+
122
+ This series corresponds to the full training set, including both base spatial reasoning and complex spatial reasoning (CSR):
123
+
124
+ | Data Usage | Checkpoint | Data Usage | Checkpoint |
125
+ | ---------- | --------------------------------------------------------- | ---------- | ----------------------------------------------------------- |
126
+ | 5% | [`nkkbr/ViCA-5p`](https://huggingface.co/nkkbr/ViCA-5p) | 55% | [`nkkbr/ViCA-55p`](https://huggingface.co/nkkbr/ViCA-55p) |
127
+ | 10% | [`nkkbr/ViCA-10p`](https://huggingface.co/nkkbr/ViCA-10p) | 60% | [`nkkbr/ViCA-60p`](https://huggingface.co/nkkbr/ViCA-60p) |
128
+ | 15% | [`nkkbr/ViCA-15p`](https://huggingface.co/nkkbr/ViCA-15p) | 65% | [`nkkbr/ViCA-65p`](https://huggingface.co/nkkbr/ViCA-65p) |
129
+ | 20% | [`nkkbr/ViCA-20p`](https://huggingface.co/nkkbr/ViCA-20p) | 70% | [`nkkbr/ViCA-70p`](https://huggingface.co/nkkbr/ViCA-70p) |
130
+ | 25% | [`nkkbr/ViCA-25p`](https://huggingface.co/nkkbr/ViCA-25p) | 75% | [`nkkbr/ViCA-75p`](https://huggingface.co/nkkbr/ViCA-75p) |
131
+ | 30% | [`nkkbr/ViCA-30p`](https://huggingface.co/nkkbr/ViCA-30p) | 80% | [`nkkbr/ViCA-80p`](https://huggingface.co/nkkbr/ViCA-80p) |
132
+ | 35% | [`nkkbr/ViCA-35p`](https://huggingface.co/nkkbr/ViCA-35p) | 85% | [`nkkbr/ViCA-85p`](https://huggingface.co/nkkbr/ViCA-85p) |
133
+ | 40% | [`nkkbr/ViCA-40p`](https://huggingface.co/nkkbr/ViCA-40p) | 90% | [`nkkbr/ViCA-90p`](https://huggingface.co/nkkbr/ViCA-90p) |
134
+ | 45% | [`nkkbr/ViCA-45p`](https://huggingface.co/nkkbr/ViCA-45p) | 95% | [`nkkbr/ViCA-95p`](https://huggingface.co/nkkbr/ViCA-95p) |
135
+ | 50% | [`nkkbr/ViCA-50p`](https://huggingface.co/nkkbr/ViCA-50p) | 100% (This repo) | [`nkkbr/ViCA`](https://huggingface.co/nkkbr/ViCA) |
136
+
137
+ Raw evaluation outputs are available [here](https://huggingface.co/nkkbr/ViCA/tree/main/raw_evaluation_outputs/vsi-bench_all_data/).
138
+
139
+ ### Base-only Subset (ViCA-322K: Base)
140
+
141
+ This series is trained **only** on the base spatial reasoning subset of ViCA-322K, without any CSR examples:
142
+
143
+ | Data Usage | Checkpoint | Data Usage | Checkpoint |
144
+ | ---------- | ------------------------------------------------------------------- | ---------- | --------------------------------------------------------------------- |
145
+ | 5% | [`nkkbr/ViCA-base-5p`](https://huggingface.co/nkkbr/ViCA-base-5p) | 55% | [`nkkbr/ViCA-base-55p`](https://huggingface.co/nkkbr/ViCA-base-55p) |
146
+ | 10% | [`nkkbr/ViCA-base-10p`](https://huggingface.co/nkkbr/ViCA-base-10p) | 60% | [`nkkbr/ViCA-base-60p`](https://huggingface.co/nkkbr/ViCA-base-60p) |
147
+ | 15% | [`nkkbr/ViCA-base-15p`](https://huggingface.co/nkkbr/ViCA-base-15p) | 65% | [`nkkbr/ViCA-base-65p`](https://huggingface.co/nkkbr/ViCA-base-65p) |
148
+ | 20% | [`nkkbr/ViCA-base-20p`](https://huggingface.co/nkkbr/ViCA-base-20p) | 70% | [`nkkbr/ViCA-base-70p`](https://huggingface.co/nkkbr/ViCA-base-70p) |
149
+ | 25% | [`nkkbr/ViCA-base-25p`](https://huggingface.co/nkkbr/ViCA-base-25p) | 75% | [`nkkbr/ViCA-base-75p`](https://huggingface.co/nkkbr/ViCA-base-75p) |
150
+ | 30% | [`nkkbr/ViCA-base-30p`](https://huggingface.co/nkkbr/ViCA-base-30p) | 80% | [`nkkbr/ViCA-base-80p`](https://huggingface.co/nkkbr/ViCA-base-80p) |
151
+ | 35% | [`nkkbr/ViCA-base-35p`](https://huggingface.co/nkkbr/ViCA-base-35p) | 85% | [`nkkbr/ViCA-base-85p`](https://huggingface.co/nkkbr/ViCA-base-85p) |
152
+ | 40% | [`nkkbr/ViCA-base-40p`](https://huggingface.co/nkkbr/ViCA-base-40p) | 90% | [`nkkbr/ViCA-base-90p`](https://huggingface.co/nkkbr/ViCA-base-90p) |
153
+ | 45% | [`nkkbr/ViCA-base-45p`](https://huggingface.co/nkkbr/ViCA-base-45p) | 95% | [`nkkbr/ViCA-base-95p`](https://huggingface.co/nkkbr/ViCA-base-95p) |
154
+ | 50% | [`nkkbr/ViCA-base-50p`](https://huggingface.co/nkkbr/ViCA-base-50p) | 100% | [`nkkbr/ViCA-base`](https://huggingface.co/nkkbr/ViCA-base) |
155
+
156
+ Raw evaluation outputs are available [here](https://huggingface.co/nkkbr/ViCA/tree/main/raw_evaluation_outputs/vsi-bench_only_base/).
157
+
158
+ ## Source-wise Checkpoints
159
+
160
+ While the full **ViCA-322K** dataset was curated by us, the underlying videos and associated metadata are sourced from three distinct indoor video datasets:
161
+
162
+ * **ARKitScenes**
163
+ * **ScanNet**
164
+ * **ScanNet++**
165
+
166
+ To better understand how each source contributes to model performance, we fine-tuned ViCA-7B on subsets of ViCA-322K that exclusively use data from each source. For each subset, we provide checkpoints trained with **10% increments** of the available data, from 10% to 100%.
167
+
168
+ Corresponding **raw evaluation outputs** (e.g., `.json` predictions) are also provided for all checkpoints.
169
+
170
+ ### ARKitScenes-Only Checkpoints
171
+
172
+ | Data Usage | Checkpoint | Data Usage | Checkpoint |
173
+ | ---------- | --------------------------------------------------------------------------------- | ---------- | ----------------------------------------------------------------------------------- |
174
+ | 10% | [`nkkbr/ViCA-ARKitScenes-10p`](https://huggingface.co/nkkbr/ViCA-ARKitScenes-10p) | 60% | [`nkkbr/ViCA-ARKitScenes-60p`](https://huggingface.co/nkkbr/ViCA-ARKitScenes-60p) |
175
+ | 20% | [`nkkbr/ViCA-ARKitScenes-20p`](https://huggingface.co/nkkbr/ViCA-ARKitScenes-20p) | 70% | [`nkkbr/ViCA-ARKitScenes-70p`](https://huggingface.co/nkkbr/ViCA-ARKitScenes-70p) |
176
+ | 30% | [`nkkbr/ViCA-ARKitScenes-30p`](https://huggingface.co/nkkbr/ViCA-ARKitScenes-30p) | 80% | [`nkkbr/ViCA-ARKitScenes-80p`](https://huggingface.co/nkkbr/ViCA-ARKitScenes-80p) |
177
+ | 40% | [`nkkbr/ViCA-ARKitScenes-40p`](https://huggingface.co/nkkbr/ViCA-ARKitScenes-40p) | 90% | [`nkkbr/ViCA-ARKitScenes-90p`](https://huggingface.co/nkkbr/ViCA-ARKitScenes-90p) |
178
+ | 50% | [`nkkbr/ViCA-ARKitScenes-50p`](https://huggingface.co/nkkbr/ViCA-ARKitScenes-50p) | 100% | [`nkkbr/ViCA-ARKitScenes`](https://huggingface.co/nkkbr/ViCA-ARKitScenes) |
179
+
180
+ 🔗 Raw evaluation outputs: [ARKitScenes results](https://huggingface.co/nkkbr/ViCA/tree/main/raw_evaluation_outputs/vsi-bench_arkitscenes/)
181
+
182
+ ### ScanNet++-Only Checkpoints
183
+
184
+ | Data Usage | Checkpoint | Data Usage | Checkpoint |
185
+ | ---------- | ----------------------------------------------------------------------------- | ---------- | ------------------------------------------------------------------------------- |
186
+ | 10% | [`nkkbr/ViCA-ScanNetPP-10p`](https://huggingface.co/nkkbr/ViCA-ScanNetPP-10p) | 60% | [`nkkbr/ViCA-ScanNetPP-60p`](https://huggingface.co/nkkbr/ViCA-ScanNetPP-60p) |
187
+ | 20% | [`nkkbr/ViCA-ScanNetPP-20p`](https://huggingface.co/nkkbr/ViCA-ScanNetPP-20p) | 70% | [`nkkbr/ViCA-ScanNetPP-70p`](https://huggingface.co/nkkbr/ViCA-ScanNetPP-70p) |
188
+ | 30% | [`nkkbr/ViCA-ScanNetPP-30p`](https://huggingface.co/nkkbr/ViCA-ScanNetPP-30p) | 80% | [`nkkbr/ViCA-ScanNetPP-80p`](https://huggingface.co/nkkbr/ViCA-ScanNetPP-80p) |
189
+ | 40% | [`nkkbr/ViCA-ScanNetPP-40p`](https://huggingface.co/nkkbr/ViCA-ScanNetPP-40p) | 90% | [`nkkbr/ViCA-ScanNetPP-90p`](https://huggingface.co/nkkbr/ViCA-ScanNetPP-90p) |
190
+ | 50% | [`nkkbr/ViCA-ScanNetPP-50p`](https://huggingface.co/nkkbr/ViCA-ScanNetPP-50p) | 100% | [`nkkbr/ViCA-ScanNetPP`](https://huggingface.co/nkkbr/ViCA-ScanNetPP) |
191
+
192
+ 🔗 Raw evaluation outputs: [ScanNet++ results](https://huggingface.co/nkkbr/ViCA/tree/main/raw_evaluation_outputs/vsi-bench_scannetpp/)
193
+
194
+ ### ScanNet-Only Checkpoints
195
+
196
+ | Data Usage | Checkpoint | Data Usage | Checkpoint |
197
+ | ---------- | ------------------------------------------------------------------------- | ---------- | --------------------------------------------------------------------------- |
198
+ | 10% | [`nkkbr/ViCA-ScanNet-10p`](https://huggingface.co/nkkbr/ViCA-ScanNet-10p) | 60% | [`nkkbr/ViCA-ScanNet-60p`](https://huggingface.co/nkkbr/ViCA-ScanNet-60p) |
199
+ | 20% | [`nkkbr/ViCA-ScanNet-20p`](https://huggingface.co/nkkbr/ViCA-ScanNet-20p) | 70% | [`nkkbr/ViCA-ScanNet-70p`](https://huggingface.co/nkkbr/ViCA-ScanNet-70p) |
200
+ | 30% | [`nkkbr/ViCA-ScanNet-30p`](https://huggingface.co/nkkbr/ViCA-ScanNet-30p) | 80% | [`nkkbr/ViCA-ScanNet-80p`](https://huggingface.co/nkkbr/ViCA-ScanNet-80p) |
201
+ | 40% | [`nkkbr/ViCA-ScanNet-40p`](https://huggingface.co/nkkbr/ViCA-ScanNet-40p) | 90% | [`nkkbr/ViCA-ScanNet-90p`](https://huggingface.co/nkkbr/ViCA-ScanNet-90p) |
202
+ | 50% | [`nkkbr/ViCA-ScanNet-50p`](https://huggingface.co/nkkbr/ViCA-ScanNet-50p) | 100% | [`nkkbr/ViCA-ScanNet`](https://huggingface.co/nkkbr/ViCA-ScanNet) |
203
+
204
+ 🔗 Raw evaluation outputs: [ScanNet results](https://huggingface.co/nkkbr/ViCA/tree/main/raw_evaluation_outputs/vsi-bench_scannet/)
205
+
206
+ ## Additional Probing
207
+
208
+ ### Time Instructions
209
+
210
+ Including 64 frame timestamps in the prompt slightly **hurts** performance, suggesting that models fail to leverage temporal alignment and are negatively impacted by instruction verbosity.
211
+
212
+ <p align="center">
213
+ <img src="assets/table3.png" width="400"/>
214
+ </p>
215
+
216
+ <p align="center"><b>Figure 5:</b> Adding explicit frame timestamps (64 values) degrades model performance on VSI-Bench, indicating an inability to exploit temporal alignment and sensitivity to prompt length.</p>
217
+
218
+ ---
219
+
220
+ ### More Frames
221
+
222
+ Increasing input from 64 to 128 frames doubles the number of visual tokens (13,440 → 26,880) but yields **no performance gain**, highlighting overfitting to fixed token length and architectural inflexibility.
223
+
224
+ <p align="center">
225
+ <img src="assets/table2.png" width="400"/>
226
+ </p>
227
+
228
+ <p align="center"><b>Figure 6:</b> Comparison between 64-frame and 128-frame inputs. Despite doubling the visual token count, performance remains unchanged, indicating overfitting to fixed-length input and limited adaptability to variable-length sequences.</p>
229
+
230
+ ## Potential Applications
231
+
232
+ ViCA-7B supports a broad range of spatially grounded multimodal applications:
233
+ - **Indoor navigation assistants**
234
+ - **Robotics planning and spatial querying**
235
+ - **Smart room arrangement and AR layout analysis**
236
+ - **Scene understanding for embodied AI agents**
237
+
238
+ ## Known Limitations
239
+
240
+ - Limited temporal reasoning: Time instructions not effectively utilized
241
+ - Frame scaling issues: Models expect fixed input lengths
242
+ - No depth/point cloud: Only RGB video input supported
243
+ - Zero-shot generalization is good, but not task-agnostic
244
+
245
+ ## Inference
246
+
247
+ *Here is a runnable example using ViCA-7B on a VSI-Bench question.*
248
+
249
+ ```python
250
+ # This inference script is adapted from:
251
+ # https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2
252
+
253
+ # pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
254
+ from llava.model.builder import load_pretrained_model
255
+ from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
256
+ from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
257
+ from llava.conversation import conv_templates, SeparatorStyle
258
+ from PIL import Image
259
+ import requests
260
+ import copy
261
+ import torch
262
+ import sys
263
+ import warnings
264
+ from decord import VideoReader, cpu
265
+ import numpy as np
266
+ import json
267
+ from tqdm import tqdm
268
+ import os
269
+
270
+ warnings.filterwarnings("ignore")
271
+ def load_video(video_path, max_frames_num,fps=1,force_sample=False):
272
+ if max_frames_num == 0:
273
+ return np.zeros((1, 336, 336, 3))
274
+ vr = VideoReader(video_path, ctx=cpu(0),num_threads=1)
275
+ total_frame_num = len(vr)
276
+ video_time = total_frame_num / vr.get_avg_fps()
277
+ fps = round(vr.get_avg_fps()/fps)
278
+ frame_idx = [i for i in range(0, len(vr), fps)]
279
+ frame_time = [i/fps for i in frame_idx]
280
+ if len(frame_idx) > max_frames_num or force_sample:
281
+ sample_fps = max_frames_num
282
+ uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int)
283
+ frame_idx = uniform_sampled_frames.tolist()
284
+ frame_time = [i/vr.get_avg_fps() for i in frame_idx]
285
+ frame_time = ",".join([f"{i:.2f}s" for i in frame_time])
286
+ spare_frames = vr.get_batch(frame_idx).asnumpy()
287
+ # import pdb;pdb.set_trace()
288
+ return spare_frames,frame_time,video_time
289
+ pretrained = 'nkkbr/ViCA'
290
+ model_name = "llava_qwen"
291
+ device = "cuda"
292
+ device_map = "auto"
293
+ tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, torch_dtype="bfloat16", device_map=device_map) # Add any other thing you want to pass in llava_model_args
294
+ model.eval()
295
+
296
+
297
+ from datasets import load_dataset
298
+ vsi_bench = load_dataset("nyu-visionx/VSI-Bench")
299
+ vsi_bench = vsi_bench['test']
300
+
301
+ data_curr = vsi_bench[1000]
302
+
303
+ video_path = f"[VIDEO PATH]"
304
+ max_frames_num = 64
305
+ video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True)
306
+ video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().to(torch.bfloat16)
307
+ video = [video]
308
+ conv_template = "qwen_1_5"
309
+ # time_instruciton = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. These frames are located at {frame_time}.Please answer the following questions related to this video."
310
+ time_instruciton = ""
311
+
312
+ question = DEFAULT_IMAGE_TOKEN + f"\n{time_instruciton}\n\n"
313
+ question += f"These are frames of a video.\n\n"
314
+ question += f"Question: {data_curr['question']}\n"
315
+ if data_curr['options'] is not None:
316
+ question += '\n'.join(data_curr['options']) + "\n"
317
+ question += f"Answer with the option’s letter from the given choices directly.\n"
318
+ else:
319
+ question += f"Please answer the question using a single word or phrase.\n"
320
+ print(f"Prompt:\n{question}")
321
+
322
+ conv = copy.deepcopy(conv_templates[conv_template])
323
+ conv.append_message(conv.roles[0], question)
324
+ conv.append_message(conv.roles[1], None)
325
+ prompt_question = conv.get_prompt()
326
+ input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
327
+
328
+ cont = model.generate(
329
+ input_ids,
330
+ images=video,
331
+ modalities= ["video"],
332
+ do_sample=False,
333
+ temperature=0,
334
+ max_new_tokens=1024,
335
+ )
336
+ text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)[0].strip()
337
+
338
+ print(repr(text_outputs))
339
+ ```
340
+
341
  ---
342
+
assets/data-scale-csr-effect.svg ADDED
assets/table2.png ADDED

Git LFS Details

  • SHA256: d80539176fbe744e57ac4837259f21fa266123ac5c2973a4ad23f84e4530b749
  • Pointer size: 131 Bytes
  • Size of remote file: 475 kB
assets/table3.png ADDED

Git LFS Details

  • SHA256: 2e8c4c5e9ef49ddf256ad5006536800feed6b3153af9c9b78395e66bf106f637
  • Pointer size: 131 Bytes
  • Size of remote file: 587 kB
assets/training_record/vica-train_grad_norm.svg ADDED
assets/training_record/vica-train_learning_rate.svg ADDED
assets/training_record/vica-train_loss_with_ema.svg ADDED
assets/vsi-bench-comparison.svg ADDED
assets/vsi-bench-table.png ADDED

Git LFS Details

  • SHA256: 3d45e974cbc2637314e157077a9876616cd385980c88fbb26f79123916d1efde
  • Pointer size: 131 Bytes
  • Size of remote file: 673 kB