PEFT
Safetensors
English
File size: 4,079 Bytes
1000a7d
 
c25128f
 
 
 
 
1000a7d
 
e0a2036
 
 
 
78a4719
 
be1b119
 
e0a2036
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
---
library_name: peft
license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
---

# video-SALMONN 2+ (Qwen 2.5-VL Based video-SALMONN 2)

video-SALMONN 2+ is Qwen 2.5-VL version of video-SALMONN 2. Based on better baseline and some minor optimizations, video-SALMONN 2+ achieves SOTA on [Video-MME](https://video-mme.github.io/home_page.html) benchmark.

[Github Link](https://github.com/bytedance/video-SALMONN-2)

[Paper Link](https://arxiv.org/abs/2506.15220)

## Results

Video-MME (w/o sub / w/ sub)

| **7B Model**         | **Short**         | **Medium**        | **Long**      | **Avg**           |
| -------------------- | ----------------- | ----------------- | ------------- | ----------------- |
| LinVT 7B             | 79.0/71.7         | 71.6/68.7         | 63.2/63.3     | 70.3/71.7         |
| VideoLLaMA3 7B       | **80.1**/**80.2** | 63.7/69.6         | 54.9/61.0     | 66.2/70.3         |
| Qwen 2.5-VL 7B       | -                 | -                 | -             |                   |
| video-SALMONN 2+ 7B  | 79.0/79.4         | **72.1**/**73.1** | 62.3/**63.9** | **71.1**/**72.1** |
| **Larger Model**     |                   |                   |               |                   |
| GPT-4o               | 80.0/82.8         | 70.3/76.6         | 65.3/72.1     | 71.9/77.2         |
| Gemini-1.5-pro       | 81.7/84.5         | 74.3/**81.0**     | 67.4/**77.4** | 75.0/**81.3**     |
| Qwen 2.5-VL 72B      | -                 | -                 | -             | 73.3/79.1         |
| video-SALMONN 2+ 72B | **84.3**/**85.1** | **79.4**/79.7     | **71.2**/72.0 | **78.3**/78.9     |

Other benchmarks

| **Model**            | **MLVU** | **LongVideoBench** | **DailyOmni** | **VideoHolmes** |
| -------------------- | -------- | ------------------ | ------------- | --------------- |
| GPT-4o               | 64.6     | 66.7               | 56.47         | 42.0(32)        |
| Gemini-1.5-pro       | -        | 64.0               | -             | 41.2            |
| Qwen 2.5-VL 72B      | 75.1     | **67.4**           | 61.82         | 50.2            |
| video-SALMONN 2+ 72B | **77.8** | 66.4               | **69.84**     | **55.6**        |

## How to Use

1. Prepare the dataset following `scripts/example_av.json`, `scripts/example_v.json`, `scripts/example_dpo.json`, and `scripts/example_a.json`
2. Prepare base audio model through modifying the path in `gen_audio_model.py`
3. To conduct audio alignment, use the following script:
   ```bash
   bash scripts/train.sh --interval 0.1 --run_name audio_alignment --dataset path_to_dataset --lr 2e-5 --train_qformer --max_frames 768 --max_pixels 61250 --model path_to_audio_model --model_base path_to_audio_model --bs 16 --epoch 5 --save_steps 5000
   ```
4. To conduct audio visual SFT, use the following script:
    ```bash
    bash scripts/train.sh --interval 0.1 --run_name av_sft --dataset path_to_dataset --lr 2e-5 --train_qformer --train_proj --max_frames 768 --max_pixels 61250 --model audio_align_model --model_base path_to_audio_model --epoch 5 --save_steps 2000 --use_lora --lora_r 128 --lora_alpha 256
    ```
5. To conduct DPO, use the following script:
    ```bash
    bash scripts/train.sh --interval 0.1 --run_name dpo --dataset path_to_dataset --max_frames 768 --max_pixels 61250 --model audio_visual_base --model_base audio_align_model --lora_ckpt audio_visual_checkpoint --train_type gdpo --use_lora --lora_r 128 --lora_alpha 256 --lr 5e-6 --epoch 1 --save_steps 200 --train_qformer --train_proj
    ```
6. To evaluate 7B model, use the following script:
   ```bash
   bash scripts/test.sh --interval 0.1 --run_name eval --dataset path_to_dataset --max_frames 768 --max_pixels 61250 --model path_to_audio_model --model_base path_to_audio_model --lora_ckpt model_ckpt
   ```
7. To evaluate 72B model, use the following script:
   ```bash
   bash scripts/test_8.sh --interval 0.1 --run_name eval --dataset path_to_dataset --max_frames 768 --max_pixels 61250 --model path_to_audio_model --model_base path_to_audio_model --lora_ckpt model_ckpt
   ```