tsinghua-ee
/

video-SALMONN-2_plus_7B

Model card Files Files and versions

video-SALMONN-2_plus_7B / README.md

DragonAura's picture

Update README.md

c25128f verified 17 days ago

|

history blame contribute delete

4.08 kB

	---
	library_name: peft
	license: apache-2.0
	language:
	- en
	base_model:
	- Qwen/Qwen2.5-VL-7B-Instruct
	---

	# video-SALMONN 2+ (Qwen 2.5-VL Based video-SALMONN 2)

	video-SALMONN 2+ is Qwen 2.5-VL version of video-SALMONN 2. Based on better baseline and some minor optimizations, video-SALMONN 2+ achieves SOTA on [Video-MME](https://video-mme.github.io/home_page.html) benchmark.

	[Github Link](https://github.com/bytedance/video-SALMONN-2)

	[Paper Link](https://arxiv.org/abs/2506.15220)

	## Results

	Video-MME (w/o sub / w/ sub)

	\| 7B Model \| Short \| Medium \| Long \| Avg \|
	\| -------------------- \| ----------------- \| ----------------- \| ------------- \| ----------------- \|
	\| LinVT 7B \| 79.0/71.7 \| 71.6/68.7 \| 63.2/63.3 \| 70.3/71.7 \|
	\| VideoLLaMA3 7B \| 80.1/80.2 \| 63.7/69.6 \| 54.9/61.0 \| 66.2/70.3 \|
	\| Qwen 2.5-VL 7B \| - \| - \| - \| \|
	\| video-SALMONN 2+ 7B \| 79.0/79.4 \| 72.1/73.1 \| 62.3/63.9 \| 71.1/72.1 \|
	\| Larger Model \| \| \| \| \|
	\| GPT-4o \| 80.0/82.8 \| 70.3/76.6 \| 65.3/72.1 \| 71.9/77.2 \|
	\| Gemini-1.5-pro \| 81.7/84.5 \| 74.3/81.0 \| 67.4/77.4 \| 75.0/81.3 \|
	\| Qwen 2.5-VL 72B \| - \| - \| - \| 73.3/79.1 \|
	\| video-SALMONN 2+ 72B \| 84.3/85.1 \| 79.4/79.7 \| 71.2/72.0 \| 78.3/78.9 \|

	Other benchmarks

	\| Model \| MLVU \| LongVideoBench \| DailyOmni \| VideoHolmes \|
	\| -------------------- \| -------- \| ------------------ \| ------------- \| --------------- \|
	\| GPT-4o \| 64.6 \| 66.7 \| 56.47 \| 42.0(32) \|
	\| Gemini-1.5-pro \| - \| 64.0 \| - \| 41.2 \|
	\| Qwen 2.5-VL 72B \| 75.1 \| 67.4 \| 61.82 \| 50.2 \|
	\| video-SALMONN 2+ 72B \| 77.8 \| 66.4 \| 69.84 \| 55.6 \|

	## How to Use

	1. Prepare the dataset following `scripts/example_av.json`, `scripts/example_v.json`, `scripts/example_dpo.json`, and `scripts/example_a.json`
	2. Prepare base audio model through modifying the path in `gen_audio_model.py`
	3. To conduct audio alignment, use the following script:
	```bash
	bash scripts/train.sh --interval 0.1 --run_name audio_alignment --dataset path_to_dataset --lr 2e-5 --train_qformer --max_frames 768 --max_pixels 61250 --model path_to_audio_model --model_base path_to_audio_model --bs 16 --epoch 5 --save_steps 5000
	```
	4. To conduct audio visual SFT, use the following script:
	```bash
	bash scripts/train.sh --interval 0.1 --run_name av_sft --dataset path_to_dataset --lr 2e-5 --train_qformer --train_proj --max_frames 768 --max_pixels 61250 --model audio_align_model --model_base path_to_audio_model --epoch 5 --save_steps 2000 --use_lora --lora_r 128 --lora_alpha 256
	```
	5. To conduct DPO, use the following script:
	```bash
	bash scripts/train.sh --interval 0.1 --run_name dpo --dataset path_to_dataset --max_frames 768 --max_pixels 61250 --model audio_visual_base --model_base audio_align_model --lora_ckpt audio_visual_checkpoint --train_type gdpo --use_lora --lora_r 128 --lora_alpha 256 --lr 5e-6 --epoch 1 --save_steps 200 --train_qformer --train_proj
	```
	6. To evaluate 7B model, use the following script:
	```bash
	bash scripts/test.sh --interval 0.1 --run_name eval --dataset path_to_dataset --max_frames 768 --max_pixels 61250 --model path_to_audio_model --model_base path_to_audio_model --lora_ckpt model_ckpt
	```
	7. To evaluate 72B model, use the following script:
	```bash
	bash scripts/test_8.sh --interval 0.1 --run_name eval --dataset path_to_dataset --max_frames 768 --max_pixels 61250 --model path_to_audio_model --model_base path_to_audio_model --lora_ckpt model_ckpt
	```