--- license: cc-by-nc-4.0 datasets: - THUdyh/Oryx-SFT-Data language: - en - zh metrics: - accuracy base_model: - google/siglip-so400m-patch14-384 - Qwen/Qwen2.5-7B-Instruct library_name: transformers model-index: - name: llava-onevision-qwen-7b-ov results: - task: type: multimodal dataset: name: MVBench type: mvbench metrics: - type: accuracy value: 62.425 name: accuracy verified: true - task: type: multimodal dataset: name: NextQA type: nextqa metrics: - type: accuracy value: 81.33 name: accuracy verified: true - task: type: multimodal dataset: name: EgoSchema type: egoschema metrics: - type: accuracy value: 58.08 name: accuracy verified: true - task: type: multimodal dataset: name: VideoMME type: videomme metrics: - type: accuracy value: 57.96 name: accuracy verified: true - task: type: multimodal dataset: name: MLVU type: mlvu metrics: - type: accuracy value: 62.48 name: accuracy verified: true - task: type: multimodal dataset: name: VideoMMMU type: videommmu metrics: - type: accuracy value: 40.55 name: accuracy verified: true tags: - llava - llava-scissor - llava-onevision - llava-ov - token-compression --- # LLaVA-Scissor-baseline-7B ## Model Summary This repository contains the baseline model used in LLaVA-Scissor. This model is an enhanced version of [LLaVA-OneVision](https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov) model with [SIGLIP](https://huggingface.co/google/siglip-so400m-patch14-384) vision encoder and [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) large language model and is finetuned with [Oryx](https://huggingface.co/datasets/THUdyh/Oryx-SFT-Data) data. ## Quick Start Here we provide a script for LLaVA-Scissor full token inference (without token compression). ```python from operator import attrgetter from llava.model.builder import load_pretrained_model from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX from llava.conversation import conv_templates, SeparatorStyle import torch import cv2 import numpy as np from PIL import Image import requests import copy import warnings from decord import VideoReader, cpu warnings.filterwarnings("ignore") # Load the OneVision model pretrained = "model_zoo/BBBBCHAN/LLaVA-Scissor-baseline-7B" model_name = "llava_qwen" device = "cuda" device_map = "auto" tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map, attn_implementation="sdpa") model.eval() # Function to extract frames from video def load_video(video_path, max_frames_num): if type(video_path) == str: vr = VideoReader(video_path, ctx=cpu(0)) else: vr = VideoReader(video_path[0], ctx=cpu(0)) total_frame_num = len(vr) uniform_sampled_frames = np.linspace(0, total_frame_num - 1, max_frames_num, dtype=int) frame_idx = uniform_sampled_frames.tolist() spare_frames = vr.get_batch(frame_idx).asnumpy() return spare_frames # (frames, height, width, channels) # Load and process video video_path = "Your/path/to/the/video" video_frames = load_video(video_path, 16) print(video_frames.shape) image_tensors = [] frames = image_processor.preprocess(video_frames, return_tensors="pt")["pixel_values"].half().cuda() image_tensors.append(frames) # Prepare conversation input conv_template = "qwen_2" question = f"{DEFAULT_IMAGE_TOKEN}\nDescribe this video." conv = copy.deepcopy(conv_templates[conv_template]) conv.append_message(conv.roles[0], question) conv.append_message(conv.roles[1], None) prompt_question = conv.get_prompt() input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device) image_sizes = [frame.size for frame in video_frames] # Generate response cont = model.generate( input_ids, images=image_tensors, image_sizes=image_sizes, do_sample=False, temperature=0, max_new_tokens=4096, modalities=["video"], ) text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True) print(text_outputs[0]) ``` ## Citation If you find our repo useful for your research, please consider citing our paper: ```bibtex @article{sun2025llava, title={LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs}, author={Sun, Boyuan and Zhao, Jiaxing and Wei, Xihan and Hou, Qibin}, journal={arXiv preprint arXiv:2506.21862}, year={2025} } ```