|
--- |
|
license: apache-2.0 |
|
tags: |
|
- multimodal |
|
- vision-language |
|
- video understanding |
|
- spatial reasoning |
|
- visuospatial cognition |
|
- llava |
|
- qwen |
|
- llava-video |
|
datasets: |
|
- nkkbr/ViCA-322K |
|
- nkkbr/ViCA-thinking-2.68k |
|
language: |
|
- en |
|
library_name: transformers |
|
pipeline_tag: video-text-to-text |
|
model_name: ViCA-7B |
|
base_model: lmms-lab/LLaVA-Video-7B-Qwen2 |
|
model-index: |
|
- name: ViCA-7B |
|
results: |
|
- task: |
|
type: visual-question-answering |
|
dataset: |
|
name: VSI-Bench |
|
type: vsi-bench |
|
metrics: |
|
- type: score |
|
value: 60.56 |
|
name: Average |
|
verified: false |
|
- type: MRA |
|
value: 68.81 |
|
name: Object Count |
|
- type: MRA |
|
value: 57.01 |
|
name: Absolute Distance |
|
- type: MRA |
|
value: 79.17 |
|
name: Object Size |
|
- type: MRA |
|
value: 75.14 |
|
name: Room Size |
|
- type: accuracy |
|
value: 58.45 |
|
name: Relative Distance |
|
- type: accuracy |
|
value: 42.56 |
|
name: Relative Direction |
|
- type: accuracy |
|
value: 34.54 |
|
name: Route Plan |
|
- type: accuracy |
|
value: 68.77 |
|
name: Appearance Order |
|
--- |
|
|
|
<div align="center"> |
|
<img src="assets/banner.png" alt="ViCA Banner"/> |
|
</div> |
|
|
|
# ViCA-7B: Visuospatial Cognitive Assistant |
|
|
|
> You may also be interested in our other project, **ViCA2**. Please refer to the following links: |
|
|
|
[](https://github.com/nkkbr/ViCA) |
|
|
|
[](https://huggingface.co/nkkbr/ViCA2) |
|
|
|
## Overview |
|
|
|
**ViCA-7B** is a vision-language model specifically fine-tuned for *visuospatial reasoning* in indoor video environments. Built upon the LLaVA-Video-7B-Qwen2 architecture, it is trained using our newly proposed **ViCA-322K dataset**, which emphasizes both structured spatial annotations and complex instruction-based reasoning tasks. |
|
|
|
ViCA-7B achieves **state-of-the-art performance** on [VSI-Bench](https://github.com/vision-x-nyu/thinking-in-space), outperforming both proprietary models like **GPT-4o** and **Gemini-1.5 Pro**, as well as larger open-source baselines. |
|
|
|
> **ViCA-7B sets a new standard for open-source multimodal spatial reasoning on indoor videos, making it a strong candidate for embodied AI and robotics use cases.** |
|
|
|
<p align="center"> |
|
<img src="assets/vsi-bench-comparison.svg" width="700"/> |
|
</p> |
|
|
|
<p align="center"><b>Figure 1:</b> Performance comparison of ViCA-7B and other models on <a href="https://github.com/vision-x-nyu/thinking-in-space">VSI-Bench</a>.</p> |
|
|
|
|
|
## Model Architecture and Training Strategy |
|
|
|
ViCA-7B is built upon the [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) framework, using **Qwen2-7B** as the language backbone and **SigLIP** as the visual encoder. |
|
|
|
**Key Training Features** |
|
|
|
- **Fixed-Length Visual Tokenization** |
|
Each video is uniformly sampled into 64 frames, and each frame is encoded into 210 visual tokens, resulting in a total of **13,440 visual tokens per example**. This fixed-length design ensures consistent memory usage and stable optimization across batches. |
|
|
|
- **Multimodal Alignment via Lightweight Projector** |
|
A simple MLP-based projector maps visual embeddings into the language embedding space, enabling effective fusion between video content and textual prompts during both training and inference. |
|
|
|
- **Efficient Distributed Training with DeepSpeed** |
|
Training is conducted using **DeepSpeed ZeRO-3 Offload** on **8× NVIDIA H100 80GB GPUs**, with full parameter and optimizer state partitioning across devices. This setup supports large batch sizes and minimizes GPU memory overhead. |
|
|
|
- **Mixed-Precision Computation (fp16)** |
|
We adopt **mixed-precision training (fp16)** to accelerate computation and reduce memory usage, without compromising accuracy. This is combined with ZeRO-3 partitioning to further enhance training scalability. |
|
|
|
|
|
The training was conducted over **55 hours**, covering both base and complex spatial reasoning subsets. |
|
|
|
## Training Dynamics |
|
|
|
<p align="center"> |
|
<img src="assets/training_record/vica-train_loss_with_ema.svg" width="100%"/> |
|
<img src="assets/training_record/vica-train_learning_rate.svg" width="100%"/> |
|
<img src="assets/training_record/vica-train_grad_norm.svg" width="100%"/> |
|
</p> |
|
|
|
<p align="center"> |
|
<b>Figure 2:</b> Training loss, learning rate schedule, and gradient norm curves during ViCA-7B fine-tuning. |
|
These curves illustrate a stable optimization process and smooth convergence under the DeepSpeed ZeRO-3 setup. |
|
</p> |
|
|
|
## Dataset |
|
|
|
ViCA-7B is fine-tuned on two complementary datasets: |
|
|
|
- [**ViCA-322K**](https://huggingface.co/datasets/nkkbr/ViCA-322K): |
|
A large-scale dataset covering both **base spatial reasoning tasks** (e.g., object distance, size, count, appearance order) and **complex spatial reasoning tasks** involving natural language questions and scene understanding. This dataset forms the core of the model's spatial reasoning capabilities. |
|
|
|
- [**ViCA-thinking-2.68k**](https://huggingface.co/datasets/nkkbr/ViCA-thinking-2.68k): |
|
A focused dataset used for instruction tuning to enhance the model's ability to **generate step-by-step reasoning traces** before outputting final answers. This supports more interpretable and cognitively-aligned response generation. |
|
|
|
For details, please refer to the individual dataset pages linked above. |
|
|
|
## Evaluation: VSI-BENCH Benchmark |
|
|
|
<p align="center"> |
|
<img src="assets/vsi-bench-table.png" width="800"/> |
|
</p> |
|
|
|
<p align="center"><b>Figure 3:</b> Quantitative comparison of ViCA-7B and baseline models on <a href="https://github.com/vision-x-nyu/thinking-in-space">VSI-Bench</a>. ViCA-7B achieves the best overall performance across both numerical and multiple-choice tasks.</p> |
|
|
|
### Effect of CSR Data |
|
|
|
| Configuration | Avg Score | |
|
|----------------------|-----------| |
|
| Base-only (281K) | 55.35 | |
|
| Full with CSR (322K) | **60.56** | |
|
|
|
> CSR(Complex Spatial Reasoning) boosts generalization and **accelerates learning**, with notable performance jumps at intermediate checkpoints (e.g., +2.02 at 50–55%). |
|
|
|
### Data Scale vs. Performance |
|
|
|
Performance improves significantly between **5% → 60%** of data usage. After **80%**, improvements plateau, indicating dataset is well-matched to model capacity. |
|
|
|
<p align="center"> |
|
<img src="assets/data-scale-csr-effect.svg" width="750"/> |
|
</p> |
|
|
|
<p align="center"><b>Figure 4:</b> Performance of ViCA-7B under varying training data sizes (from 5% to 100%). The full dataset (including Complex Spatial Reasoning, CSR) consistently outperforms the base-only configuration. Notably, the CSR-enhanced model shows a +2.02 score jump between 50% and 55%, and a final performance gain of +4.75 at full scale. Performance plateaus beyond 80%, indicating the dataset is well-aligned with the model capacity.</p> |
|
|
|
## Intermediate Checkpoints and Evaluation Outputs |
|
|
|
To support detailed analysis and reproducibility, we provide two sets of intermediate checkpoints saved at every **5% increment** of the training data. These models are trained for a single epoch and are useful for understanding how performance evolves as training progresses. |
|
|
|
We also release the corresponding **raw evaluation outputs** (e.g., `.json` prediction files) for each checkpoint. |
|
The evaluation script used to produce these outputs is available in our [GitHub repository](https://github.com/nkkbr/ViCA). |
|
|
|
### Full Dataset (ViCA-322K: Base + CSR) |
|
|
|
This series corresponds to the full training set, including both base spatial reasoning and complex spatial reasoning (CSR): |
|
|
|
| Data Usage | Checkpoint | Data Usage | Checkpoint | |
|
| ---------- | --------------------------------------------------------- | ---------- | ----------------------------------------------------------- | |
|
| 5% | [`nkkbr/ViCA-5p`](https://huggingface.co/nkkbr/ViCA-5p) | 55% | [`nkkbr/ViCA-55p`](https://huggingface.co/nkkbr/ViCA-55p) | |
|
| 10% | [`nkkbr/ViCA-10p`](https://huggingface.co/nkkbr/ViCA-10p) | 60% | [`nkkbr/ViCA-60p`](https://huggingface.co/nkkbr/ViCA-60p) | |
|
| 15% | [`nkkbr/ViCA-15p`](https://huggingface.co/nkkbr/ViCA-15p) | 65% | [`nkkbr/ViCA-65p`](https://huggingface.co/nkkbr/ViCA-65p) | |
|
| 20% | [`nkkbr/ViCA-20p`](https://huggingface.co/nkkbr/ViCA-20p) | 70% | [`nkkbr/ViCA-70p`](https://huggingface.co/nkkbr/ViCA-70p) | |
|
| 25% | [`nkkbr/ViCA-25p`](https://huggingface.co/nkkbr/ViCA-25p) | 75% | [`nkkbr/ViCA-75p`](https://huggingface.co/nkkbr/ViCA-75p) | |
|
| 30% | [`nkkbr/ViCA-30p`](https://huggingface.co/nkkbr/ViCA-30p) | 80% | [`nkkbr/ViCA-80p`](https://huggingface.co/nkkbr/ViCA-80p) | |
|
| 35% | [`nkkbr/ViCA-35p`](https://huggingface.co/nkkbr/ViCA-35p) | 85% | [`nkkbr/ViCA-85p`](https://huggingface.co/nkkbr/ViCA-85p) | |
|
| 40% | [`nkkbr/ViCA-40p`](https://huggingface.co/nkkbr/ViCA-40p) | 90% | [`nkkbr/ViCA-90p`](https://huggingface.co/nkkbr/ViCA-90p) | |
|
| 45% | [`nkkbr/ViCA-45p`](https://huggingface.co/nkkbr/ViCA-45p) | 95% | [`nkkbr/ViCA-95p`](https://huggingface.co/nkkbr/ViCA-95p) | |
|
| 50% | [`nkkbr/ViCA-50p`](https://huggingface.co/nkkbr/ViCA-50p) | 100% (This repo) | [`nkkbr/ViCA`](https://huggingface.co/nkkbr/ViCA) | |
|
|
|
Raw evaluation outputs are available [here](https://huggingface.co/nkkbr/ViCA/tree/main/raw_evaluation_outputs/vsi-bench_all_data/). |
|
|
|
### Base-only Subset (ViCA-322K: Base) |
|
|
|
This series is trained **only** on the base spatial reasoning subset of ViCA-322K, without any CSR examples: |
|
|
|
| Data Usage | Checkpoint | Data Usage | Checkpoint | |
|
| ---------- | ------------------------------------------------------------------- | ---------- | --------------------------------------------------------------------- | |
|
| 5% | [`nkkbr/ViCA-base-5p`](https://huggingface.co/nkkbr/ViCA-base-5p) | 55% | [`nkkbr/ViCA-base-55p`](https://huggingface.co/nkkbr/ViCA-base-55p) | |
|
| 10% | [`nkkbr/ViCA-base-10p`](https://huggingface.co/nkkbr/ViCA-base-10p) | 60% | [`nkkbr/ViCA-base-60p`](https://huggingface.co/nkkbr/ViCA-base-60p) | |
|
| 15% | [`nkkbr/ViCA-base-15p`](https://huggingface.co/nkkbr/ViCA-base-15p) | 65% | [`nkkbr/ViCA-base-65p`](https://huggingface.co/nkkbr/ViCA-base-65p) | |
|
| 20% | [`nkkbr/ViCA-base-20p`](https://huggingface.co/nkkbr/ViCA-base-20p) | 70% | [`nkkbr/ViCA-base-70p`](https://huggingface.co/nkkbr/ViCA-base-70p) | |
|
| 25% | [`nkkbr/ViCA-base-25p`](https://huggingface.co/nkkbr/ViCA-base-25p) | 75% | [`nkkbr/ViCA-base-75p`](https://huggingface.co/nkkbr/ViCA-base-75p) | |
|
| 30% | [`nkkbr/ViCA-base-30p`](https://huggingface.co/nkkbr/ViCA-base-30p) | 80% | [`nkkbr/ViCA-base-80p`](https://huggingface.co/nkkbr/ViCA-base-80p) | |
|
| 35% | [`nkkbr/ViCA-base-35p`](https://huggingface.co/nkkbr/ViCA-base-35p) | 85% | [`nkkbr/ViCA-base-85p`](https://huggingface.co/nkkbr/ViCA-base-85p) | |
|
| 40% | [`nkkbr/ViCA-base-40p`](https://huggingface.co/nkkbr/ViCA-base-40p) | 90% | [`nkkbr/ViCA-base-90p`](https://huggingface.co/nkkbr/ViCA-base-90p) | |
|
| 45% | [`nkkbr/ViCA-base-45p`](https://huggingface.co/nkkbr/ViCA-base-45p) | 95% | [`nkkbr/ViCA-base-95p`](https://huggingface.co/nkkbr/ViCA-base-95p) | |
|
| 50% | [`nkkbr/ViCA-base-50p`](https://huggingface.co/nkkbr/ViCA-base-50p) | 100% | [`nkkbr/ViCA-base`](https://huggingface.co/nkkbr/ViCA-base) | |
|
|
|
Raw evaluation outputs are available [here](https://huggingface.co/nkkbr/ViCA/tree/main/raw_evaluation_outputs/vsi-bench_only_base/). |
|
|
|
## Source-wise Checkpoints |
|
|
|
While the full **ViCA-322K** dataset was curated by us, the underlying videos and associated metadata are sourced from three distinct indoor video datasets: |
|
|
|
* **[ARKitScenes](https://machinelearning.apple.com/research/arkitscenes)** |
|
* **[ScanNet](http://www.scan-net.org)** |
|
* **[ScanNet++](https://kaldir.vc.in.tum.de/scannetpp/)** |
|
|
|
To better understand how each source contributes to model performance, we fine-tuned ViCA-7B on subsets of ViCA-322K that exclusively use data from each source. For each subset, we provide checkpoints trained with **10% increments** of the available data, from 10% to 100%. |
|
|
|
Corresponding **raw evaluation outputs** (e.g., `.json` predictions) are also provided for all checkpoints. |
|
|
|
### ARKitScenes-Only Checkpoints |
|
|
|
| Data Usage | Checkpoint | Data Usage | Checkpoint | |
|
| ---------- | --------------------------------------------------------------------------------- | ---------- | ----------------------------------------------------------------------------------- | |
|
| 10% | [`nkkbr/ViCA-ARKitScenes-10p`](https://huggingface.co/nkkbr/ViCA-ARKitScenes-10p) | 60% | [`nkkbr/ViCA-ARKitScenes-60p`](https://huggingface.co/nkkbr/ViCA-ARKitScenes-60p) | |
|
| 20% | [`nkkbr/ViCA-ARKitScenes-20p`](https://huggingface.co/nkkbr/ViCA-ARKitScenes-20p) | 70% | [`nkkbr/ViCA-ARKitScenes-70p`](https://huggingface.co/nkkbr/ViCA-ARKitScenes-70p) | |
|
| 30% | [`nkkbr/ViCA-ARKitScenes-30p`](https://huggingface.co/nkkbr/ViCA-ARKitScenes-30p) | 80% | [`nkkbr/ViCA-ARKitScenes-80p`](https://huggingface.co/nkkbr/ViCA-ARKitScenes-80p) | |
|
| 40% | [`nkkbr/ViCA-ARKitScenes-40p`](https://huggingface.co/nkkbr/ViCA-ARKitScenes-40p) | 90% | [`nkkbr/ViCA-ARKitScenes-90p`](https://huggingface.co/nkkbr/ViCA-ARKitScenes-90p) | |
|
| 50% | [`nkkbr/ViCA-ARKitScenes-50p`](https://huggingface.co/nkkbr/ViCA-ARKitScenes-50p) | 100% | [`nkkbr/ViCA-ARKitScenes`](https://huggingface.co/nkkbr/ViCA-ARKitScenes) | |
|
|
|
🔗 Raw evaluation outputs: [ARKitScenes results](https://huggingface.co/nkkbr/ViCA/tree/main/raw_evaluation_outputs/vsi-bench_arkitscenes/) |
|
|
|
### ScanNet++-Only Checkpoints |
|
|
|
| Data Usage | Checkpoint | Data Usage | Checkpoint | |
|
| ---------- | ----------------------------------------------------------------------------- | ---------- | ------------------------------------------------------------------------------- | |
|
| 10% | [`nkkbr/ViCA-ScanNetPP-10p`](https://huggingface.co/nkkbr/ViCA-ScanNetPP-10p) | 60% | [`nkkbr/ViCA-ScanNetPP-60p`](https://huggingface.co/nkkbr/ViCA-ScanNetPP-60p) | |
|
| 20% | [`nkkbr/ViCA-ScanNetPP-20p`](https://huggingface.co/nkkbr/ViCA-ScanNetPP-20p) | 70% | [`nkkbr/ViCA-ScanNetPP-70p`](https://huggingface.co/nkkbr/ViCA-ScanNetPP-70p) | |
|
| 30% | [`nkkbr/ViCA-ScanNetPP-30p`](https://huggingface.co/nkkbr/ViCA-ScanNetPP-30p) | 80% | [`nkkbr/ViCA-ScanNetPP-80p`](https://huggingface.co/nkkbr/ViCA-ScanNetPP-80p) | |
|
| 40% | [`nkkbr/ViCA-ScanNetPP-40p`](https://huggingface.co/nkkbr/ViCA-ScanNetPP-40p) | 90% | [`nkkbr/ViCA-ScanNetPP-90p`](https://huggingface.co/nkkbr/ViCA-ScanNetPP-90p) | |
|
| 50% | [`nkkbr/ViCA-ScanNetPP-50p`](https://huggingface.co/nkkbr/ViCA-ScanNetPP-50p) | 100% | [`nkkbr/ViCA-ScanNetPP`](https://huggingface.co/nkkbr/ViCA-ScanNetPP) | |
|
|
|
🔗 Raw evaluation outputs: [ScanNet++ results](https://huggingface.co/nkkbr/ViCA/tree/main/raw_evaluation_outputs/vsi-bench_scannetpp/) |
|
|
|
### ScanNet-Only Checkpoints |
|
|
|
| Data Usage | Checkpoint | Data Usage | Checkpoint | |
|
| ---------- | ------------------------------------------------------------------------- | ---------- | --------------------------------------------------------------------------- | |
|
| 10% | [`nkkbr/ViCA-ScanNet-10p`](https://huggingface.co/nkkbr/ViCA-ScanNet-10p) | 60% | [`nkkbr/ViCA-ScanNet-60p`](https://huggingface.co/nkkbr/ViCA-ScanNet-60p) | |
|
| 20% | [`nkkbr/ViCA-ScanNet-20p`](https://huggingface.co/nkkbr/ViCA-ScanNet-20p) | 70% | [`nkkbr/ViCA-ScanNet-70p`](https://huggingface.co/nkkbr/ViCA-ScanNet-70p) | |
|
| 30% | [`nkkbr/ViCA-ScanNet-30p`](https://huggingface.co/nkkbr/ViCA-ScanNet-30p) | 80% | [`nkkbr/ViCA-ScanNet-80p`](https://huggingface.co/nkkbr/ViCA-ScanNet-80p) | |
|
| 40% | [`nkkbr/ViCA-ScanNet-40p`](https://huggingface.co/nkkbr/ViCA-ScanNet-40p) | 90% | [`nkkbr/ViCA-ScanNet-90p`](https://huggingface.co/nkkbr/ViCA-ScanNet-90p) | |
|
| 50% | [`nkkbr/ViCA-ScanNet-50p`](https://huggingface.co/nkkbr/ViCA-ScanNet-50p) | 100% | [`nkkbr/ViCA-ScanNet`](https://huggingface.co/nkkbr/ViCA-ScanNet) | |
|
|
|
🔗 Raw evaluation outputs: [ScanNet results](https://huggingface.co/nkkbr/ViCA/tree/main/raw_evaluation_outputs/vsi-bench_scannet/) |
|
|
|
## Additional Probing |
|
|
|
### Time Instructions |
|
|
|
Including 64 frame timestamps in the prompt slightly **hurts** performance, suggesting that models fail to leverage temporal alignment and are negatively impacted by instruction verbosity. |
|
|
|
<p align="center"> |
|
<img src="assets/table3.png" width="400"/> |
|
</p> |
|
|
|
<p align="center"><b>Figure 5:</b> Adding explicit frame timestamps (64 values) degrades model performance on VSI-Bench, indicating an inability to exploit temporal alignment and sensitivity to prompt length.</p> |
|
|
|
--- |
|
|
|
### More Frames |
|
|
|
Increasing input from 64 to 128 frames doubles the number of visual tokens (13,440 → 26,880) but yields **no performance gain**, highlighting overfitting to fixed token length and architectural inflexibility. |
|
|
|
<p align="center"> |
|
<img src="assets/table2.png" width="400"/> |
|
</p> |
|
|
|
<p align="center"><b>Figure 6:</b> Comparison between 64-frame and 128-frame inputs. Despite doubling the visual token count, performance remains unchanged, indicating overfitting to fixed-length input and limited adaptability to variable-length sequences.</p> |
|
|
|
## Potential Applications |
|
|
|
ViCA-7B supports a broad range of spatially grounded multimodal applications: |
|
- Indoor navigation assistants |
|
- Robotics planning and spatial querying |
|
- Smart room arrangement and AR layout analysis |
|
- Scene understanding for embodied AI agents |
|
|
|
## Known Limitations |
|
|
|
- Limited temporal reasoning: Time instructions not effectively utilized |
|
- Frame scaling issues: Models expect fixed input lengths |
|
- No depth/point cloud: Only RGB video input supported |
|
- Zero-shot generalization is good, but not task-agnostic |
|
|
|
## Download |
|
|
|
You can download the model weights to your local environment (optional). |
|
|
|
```python |
|
from huggingface_hub import snapshot_download |
|
|
|
save_dir = "./ViCA" |
|
repo_id = "nkkbr/ViCA" |
|
cache_dir = save_dir + "/cache" |
|
|
|
snapshot_download(cache_dir=cache_dir, |
|
local_dir=save_dir, |
|
repo_id=repo_id, |
|
local_dir_use_symlinks=False, |
|
resume_download=True, |
|
) |
|
``` |
|
|
|
## Inference |
|
|
|
*Here is a runnable example using ViCA-7B on a VSI-Bench question.* |
|
|
|
```python |
|
# This inference script is adapted from: |
|
# https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2 |
|
|
|
# pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git |
|
from llava.model.builder import load_pretrained_model |
|
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token |
|
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX |
|
from llava.conversation import conv_templates, SeparatorStyle |
|
from PIL import Image |
|
import requests |
|
import copy |
|
import torch |
|
import sys |
|
import warnings |
|
from decord import VideoReader, cpu |
|
import numpy as np |
|
import json |
|
from tqdm import tqdm |
|
import os |
|
|
|
warnings.filterwarnings("ignore") |
|
def load_video(video_path, max_frames_num,fps=1,force_sample=False): |
|
if max_frames_num == 0: |
|
return np.zeros((1, 336, 336, 3)) |
|
vr = VideoReader(video_path, ctx=cpu(0),num_threads=1) |
|
total_frame_num = len(vr) |
|
video_time = total_frame_num / vr.get_avg_fps() |
|
fps = round(vr.get_avg_fps()/fps) |
|
frame_idx = [i for i in range(0, len(vr), fps)] |
|
frame_time = [i/fps for i in frame_idx] |
|
if len(frame_idx) > max_frames_num or force_sample: |
|
sample_fps = max_frames_num |
|
uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int) |
|
frame_idx = uniform_sampled_frames.tolist() |
|
frame_time = [i/vr.get_avg_fps() for i in frame_idx] |
|
frame_time = ",".join([f"{i:.2f}s" for i in frame_time]) |
|
spare_frames = vr.get_batch(frame_idx).asnumpy() |
|
# import pdb;pdb.set_trace() |
|
return spare_frames,frame_time,video_time |
|
pretrained = 'nkkbr/ViCA' |
|
model_name = "llava_qwen" |
|
device = "cuda" |
|
device_map = "auto" |
|
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, torch_dtype="bfloat16", device_map=device_map) # Add any other thing you want to pass in llava_model_args |
|
model.eval() |
|
|
|
|
|
from datasets import load_dataset |
|
vsi_bench = load_dataset("nyu-visionx/VSI-Bench") |
|
vsi_bench = vsi_bench['test'] |
|
|
|
data_curr = vsi_bench[1000] |
|
|
|
video_path = f"[VIDEO PATH]" |
|
max_frames_num = 64 |
|
video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True) |
|
video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().to(torch.bfloat16) |
|
video = [video] |
|
conv_template = "qwen_1_5" |
|
# time_instruciton = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. These frames are located at {frame_time}.Please answer the following questions related to this video." |
|
time_instruciton = "" |
|
|
|
question = DEFAULT_IMAGE_TOKEN + f"\n{time_instruciton}\n\n" |
|
question += f"These are frames of a video.\n\n" |
|
question += f"Question: {data_curr['question']}\n" |
|
if data_curr['options'] is not None: |
|
question += '\n'.join(data_curr['options']) + "\n" |
|
question += f"Answer with the option’s letter from the given choices directly.\n" |
|
else: |
|
question += f"Please answer the question using a single word or phrase.\n" |
|
print(f"Prompt:\n{question}") |
|
|
|
conv = copy.deepcopy(conv_templates[conv_template]) |
|
conv.append_message(conv.roles[0], question) |
|
conv.append_message(conv.roles[1], None) |
|
prompt_question = conv.get_prompt() |
|
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device) |
|
|
|
cont = model.generate( |
|
input_ids, |
|
images=video, |
|
modalities= ["video"], |
|
do_sample=False, |
|
temperature=0, |
|
max_new_tokens=1024, |
|
) |
|
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)[0].strip() |
|
|
|
print(repr(text_outputs)) |
|
``` |
|
|
|
--- |
|
|
|
|