video-SALMONN 2+ (Qwen 2.5-VL Based video-SALMONN 2)
video-SALMONN 2+ is Qwen 2.5-VL version of video-SALMONN 2. Based on better baseline and some minor optimizations, video-SALMONN 2+ achieves SOTA on Video-MME benchmark.
Results
Video-MME (w/o sub / w/ sub)
7B Model | Short | Medium | Long | Avg |
---|---|---|---|---|
LinVT 7B | 79.0/71.7 | 71.6/68.7 | 63.2/63.3 | 70.3/71.7 |
VideoLLaMA3 7B | 80.1/80.2 | 63.7/69.6 | 54.9/61.0 | 66.2/70.3 |
Qwen 2.5-VL 7B | - | - | - | |
video-SALMONN 2+ 7B | 79.0/79.4 | 72.1/73.1 | 62.3/63.9 | 71.1/72.1 |
Larger Model | ||||
GPT-4o | 80.0/82.8 | 70.3/76.6 | 65.3/72.1 | 71.9/77.2 |
Gemini-1.5-pro | 81.7/84.5 | 74.3/81.0 | 67.4/77.4 | 75.0/81.3 |
Qwen 2.5-VL 72B | - | - | - | 73.3/79.1 |
video-SALMONN 2+ 72B | 84.3/85.1 | 79.4/79.7 | 71.2/72.0 | 78.3/78.9 |
Other benchmarks
Model | MLVU | LongVideoBench | DailyOmni | VideoHolmes |
---|---|---|---|---|
GPT-4o | 64.6 | 66.7 | 56.47 | 42.0(32) |
Gemini-1.5-pro | - | 64.0 | - | 41.2 |
Qwen 2.5-VL 72B | 75.1 | 67.4 | 61.82 | 50.2 |
video-SALMONN 2+ 72B | 77.8 | 66.4 | 69.84 | 55.6 |
How to Use
- Prepare the dataset following
scripts/example_av.json
,scripts/example_v.json
,scripts/example_dpo.json
, andscripts/example_a.json
- Prepare base audio model through modifying the path in
gen_audio_model.py
- To conduct audio alignment, use the following script:
bash scripts/train.sh --interval 0.1 --run_name audio_alignment --dataset path_to_dataset --lr 2e-5 --train_qformer --max_frames 768 --max_pixels 61250 --model path_to_audio_model --model_base path_to_audio_model --bs 16 --epoch 5 --save_steps 5000
- To conduct audio visual SFT, use the following script:
bash scripts/train.sh --interval 0.1 --run_name av_sft --dataset path_to_dataset --lr 2e-5 --train_qformer --train_proj --max_frames 768 --max_pixels 61250 --model audio_align_model --model_base path_to_audio_model --epoch 5 --save_steps 2000 --use_lora --lora_r 128 --lora_alpha 256
- To conduct DPO, use the following script:
bash scripts/train.sh --interval 0.1 --run_name dpo --dataset path_to_dataset --max_frames 768 --max_pixels 61250 --model audio_visual_base --model_base audio_align_model --lora_ckpt audio_visual_checkpoint --train_type gdpo --use_lora --lora_r 128 --lora_alpha 256 --lr 5e-6 --epoch 1 --save_steps 200 --train_qformer --train_proj
- To evaluate 7B model, use the following script:
bash scripts/test.sh --interval 0.1 --run_name eval --dataset path_to_dataset --max_frames 768 --max_pixels 61250 --model path_to_audio_model --model_base path_to_audio_model --lora_ckpt model_ckpt
- To evaluate 72B model, use the following script:
bash scripts/test_8.sh --interval 0.1 --run_name eval --dataset path_to_dataset --max_frames 768 --max_pixels 61250 --model path_to_audio_model --model_base path_to_audio_model --lora_ckpt model_ckpt
- Downloads last month
- 82
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for tsinghua-ee/video-SALMONN-2_plus_7B
Base model
Qwen/Qwen2.5-VL-7B-Instruct