Video-CCAM-v1.1
Collection
The upgraded version of Video-CCAM.
•
3 items
•
Updated
•
1
Video-CCAM-4B-v1.1 is a lightweight Video-MLLM developed by TencentQQ Multimedia Research Team.
Note: Unlike previous Video-CCAM-4B, this model is developed from the latest version of Phi-3-mini-4k-instruct.
Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.9/3.10.
pip install -U pip torch transformers peft decord pysubs2 imageio
import os
import torch
from PIL import Image
from transformers import AutoModel
from eval import load_decord
os.environ['TOKENIZERS_PARALLELISM'] = 'false'
videoccam = AutoModel.from_pretrained(
'<your_local_path_1>',
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map='auto',
_attn_implementation='flash_attention_2',
# llm_name_or_path='<your_local_llm_path>',
# vision_encoder_name_or_path='<your_local_vision_encoder_path>'
)
messages = [
[
{
'role': 'user',
'content': '<image>\nDescribe this image in detail.'
}
], [
{
'role': 'user',
'content': '<video>\nDescribe this video in detail.'
}
]
]
images = [
Image.open('assets/example_image.jpg').convert('RGB'),
load_decord('assets/example_video.mp4', sample_type='uniform', num_frames=32)
]
response = videoccam.chat(messages, images, max_new_tokens=512, do_sample=False)
print(response)
Please refer to Video-CCAM for more details.
Benchmark | Video-CCAM-4B | Video-CCAM-4B-v1.1 |
---|---|---|
MVBench (32 frames) | 57.43 | 62.80 |
MSVD-QA (32 frames) | 75.0/4.0 | 76.9/4.1 |
MSRVTT-QA (32 frames) | 57.6/3.5 | 64.4/3.7 |
ActivityNet-QA (32 frames) | 53.3/3.6 | 58.0/3.7 |
TGIF-QA (32 frames) | 83.7/4.4 | 83.0/4.4 |
Video-MME (w/o sub, 96 frames) | 49.7 | 50.1 |
Video-MME (w sub, 96 frames) | 52.8 | 51.2 |
MLVU (M-Avg, 96 frames) | 57.3 | 56.5 |
MLVU (G-Avg, 96 frames) | 3.83 | 4.09 |
VideoVista (96 frames) | 68.09 | 70.82 |
gpt-3.5-turbo-0125
.The model is licensed under the MIT license.