Model Summary

Video-CCAM-4B-v1.1 is a lightweight Video-MLLM developed by TencentQQ Multimedia Research Team.

Note: Unlike previous Video-CCAM-4B, this model is developed from the latest version of Phi-3-mini-4k-instruct.

Usage

Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.9/3.10.

pip install -U pip torch transformers peft decord pysubs2 imageio

Inference

import os
import torch
from PIL import Image
from transformers import AutoModel

from eval import load_decord

os.environ['TOKENIZERS_PARALLELISM'] = 'false'

videoccam = AutoModel.from_pretrained(
    '<your_local_path_1>',
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map='auto',
    _attn_implementation='flash_attention_2',
    # llm_name_or_path='<your_local_llm_path>',
    # vision_encoder_name_or_path='<your_local_vision_encoder_path>'
)


messages = [
    [
        {
            'role': 'user',
            'content': '<image>\nDescribe this image in detail.'
        }
    ], [
        {
            'role': 'user',
            'content': '<video>\nDescribe this video in detail.'
        }
    ]
]

images = [
    Image.open('assets/example_image.jpg').convert('RGB'),
    load_decord('assets/example_video.mp4', sample_type='uniform', num_frames=32)
]

response = videoccam.chat(messages, images, max_new_tokens=512, do_sample=False)

print(response)

Please refer to Video-CCAM for more details.

Benchmarks

Benchmark Video-CCAM-4B Video-CCAM-4B-v1.1
MVBench (32 frames) 57.43 62.80
MSVD-QA (32 frames) 75.0/4.0 76.9/4.1
MSRVTT-QA (32 frames) 57.6/3.5 64.4/3.7
ActivityNet-QA (32 frames) 53.3/3.6 58.0/3.7
TGIF-QA (32 frames) 83.7/4.4 83.0/4.4
Video-MME (w/o sub, 96 frames) 49.7 50.1
Video-MME (w sub, 96 frames) 52.8 51.2
MLVU (M-Avg, 96 frames) 57.3 56.5
MLVU (G-Avg, 96 frames) 3.83 4.09
VideoVista (96 frames) 68.09 70.82
  • The accuracies and scores of MSVD-QA,MSRVTT-QA,ActivityNet-QA,TGIF-QA are evaluated by gpt-3.5-turbo-0125.

Acknowledgement

  • xtuner: Video-CCAM-14B is trained using the xtuner framework. Thanks for their excellent works!
  • Phi-3-mini-4k-instruct: Powerful language models developed by Microsoft.
  • SigLIP SO400M: Outstanding vision encoder developed by Google.

License

The model is licensed under the MIT license.

Downloads last month
2
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.

Collection including JaronTHU/Video-CCAM-4B-v1.1