danielhanchen commited on
Commit
9d95bae
·
verified ·
1 Parent(s): c8bd732

Add files using upload-large-folder tool

Browse files
README.md CHANGED
@@ -1,59 +1,24 @@
1
  ---
2
- base_model: Qwen/Qwen2.5-VL-3B-Instruct
 
 
 
 
 
 
3
  language:
4
  - en
5
- library_name: transformers
6
  pipeline_tag: image-text-to-text
7
- license: apache-2.0
8
  tags:
9
  - multimodal
10
- - qwen
11
- - qwen2
12
  - unsloth
13
- - transformers
14
- - vision
15
  ---
16
- <div>
17
- <p style="margin-bottom: 0;margin-top:0;">
18
- <em>View all of our uploaded models <a href="https://docs.unsloth.ai/get-started/all-our-models">here</em>
19
- </p>
20
- <div style="display: flex; gap: 5px; align-items: center;margin-top:0; ">
21
- <a href="https://github.com/unslothai/unsloth/">
22
- <img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="133">
23
- </a>
24
- <a href="https://discord.gg/unsloth">
25
- <img src="https://github.com/unslothai/unsloth/raw/main/images/Discord%20button.png" width="173">
26
- </a>
27
- <a href="https://docs.unsloth.ai/">
28
- <img src="https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/documentation%20green%20button.png" width="143">
29
- </a>
30
- </div>
31
- <h1 style="margin-top: 0rem;">Finetune LLMs 2-5x faster with 70% less memory via Unsloth</h2>
32
- </div>
33
- We have a free Google Colab Tesla T4 notebook for Qwen2-VL (7B) here: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2_VL_(7B)-Vision.ipynb
34
-
35
- ## ✨ Finetune for Free
36
-
37
- All notebooks are **beginner friendly**! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face.
38
-
39
- | Unsloth supports | Free Notebooks | Performance | Memory use |
40
- |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------|
41
- | **Llama-3.2 (3B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb) | 2.4x faster | 58% less |
42
- | **Llama-3.2 (11B vision)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb) | 2x faster | 60% less |
43
- | **Qwen2 VL (7B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2_VL_(7B)-Vision.ipynb) | 1.8x faster | 60% less |
44
- | **Qwen2.5 (7B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2.5_(7B)-Alpaca.ipynb) | 2x faster | 60% less |
45
- | **Llama-3.1 (8B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-Alpaca.ipynb) | 2.4x faster | 58% less |
46
- | **Phi-3.5 (mini)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Phi_3.5_Mini-Conversational.ipynb) | 2x faster | 50% less |
47
- | **Gemma 2 (9B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma2_(9B)-Alpaca.ipynb) | 2.4x faster | 58% less |
48
- | **Mistral (7B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_v0.3_(7B)-Conversational.ipynb) | 2.2x faster | 62% less |
49
-
50
- [<img src="https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/documentation%20green%20button.png" width="200"/>](https://docs.unsloth.ai)
51
-
52
- - This [Llama 3.2 conversational notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb) is useful for ShareGPT ChatML / Vicuna templates.
53
- - This [text completion notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_(7B)-Text_Completion.ipynb) is for raw text. This [DPO notebook](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing) replicates Zephyr.
54
- - \* Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster.
55
-
56
- # Qwen2.5-VL
57
 
58
  ## Introduction
59
 
@@ -87,7 +52,7 @@ We extend dynamic resolution to the temporal dimension by adopting dynamic FPS s
87
  We enhance both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM.
88
 
89
 
90
- We have three models with 3, 7 and 72 billion parameters. This repo contains the instruction-tuned 7B Qwen2.5-VL model. For more information, visit our [Blog](https://qwenlm.github.io/blog/qwen2.5-vl/) and [GitHub](https://github.com/QwenLM/Qwen2.5-VL).
91
 
92
 
93
 
@@ -95,50 +60,45 @@ We have three models with 3, 7 and 72 billion parameters. This repo contains the
95
 
96
  ### Image benchmark
97
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
98
 
99
- | Benchmark | InternVL2.5-8B | MiniCPM-o 2.6 | GPT-4o-mini | Qwen2-VL-7B |**Qwen2.5-VL-7B** |
100
- | :--- | :---: | :---: | :---: | :---: | :---: |
101
- | MMMU<sub>val</sub> | 56 | 50.4 | **60**| 54.1 | 58.6|
102
- | MMMU-Pro<sub>val</sub> | 34.3 | - | 37.6| 30.5 | 41.0|
103
- | DocVQA<sub>test</sub> | 93 | 93 | - | 94.5 | **95.7** |
104
- | InfoVQA<sub>test</sub> | 77.6 | - | - |76.5 | **82.6** |
105
- | ChartQA<sub>test</sub> | 84.8 | - |- | 83.0 |**87.3** |
106
- | TextVQA<sub>val</sub> | 79.1 | 80.1 | -| 84.3 | **84.9**|
107
- | OCRBench | 822 | 852 | 785 | 845 | **864** |
108
- | CC_OCR | 57.7 | | | 61.6 | **77.8**|
109
- | MMStar | 62.8| | |60.7| **63.9**|
110
- | MMBench-V1.1-En<sub>test</sub> | 79.4 | 78.0 | 76.0| 80.7 | **82.6** |
111
- | MMT-Bench<sub>test</sub> | - | - | - |**63.7** |63.6 |
112
- | MMStar | **61.5** | 57.5 | 54.8 | 60.7 |63.9 |
113
- | MMVet<sub>GPT-4-Turbo</sub> | 54.2 | 60.0 | 66.9 | 62.0 | **67.1**|
114
- | HallBench<sub>avg</sub> | 45.2 | 48.1 | 46.1| 50.6 | **52.9**|
115
- | MathVista<sub>testmini</sub> | 58.3 | 60.6 | 52.4 | 58.2 | **68.2**|
116
- | MathVision | - | - | - | 16.3 | **25.07** |
117
-
118
- ### Video Benchmarks
119
-
120
- | Benchmark | Qwen2-VL-7B | **Qwen2.5-VL-7B** |
121
- | :--- | :---: | :---: |
122
- | MVBench | 67.0 | **69.6** |
123
- | PerceptionTest<sub>test</sub> | 66.9 | **70.5** |
124
- | Video-MME<sub>wo/w subs</sub> | 63.3/69.0 | **65.1**/**71.6** |
125
- | LVBench | | 45.3 |
126
- | LongVideoBench | | 54.7 |
127
- | MMBench-Video | 1.44 | 1.79 |
128
- | TempCompass | | 71.7 |
129
- | MLVU | | 70.2 |
130
- | CharadesSTA/mIoU | 43.6|
131
 
132
  ### Agent benchmark
133
- | Benchmarks | Qwen2.5-VL-7B |
134
  |-------------------------|---------------|
135
- | ScreenSpot | 84.7 |
136
- | ScreenSpot Pro | 29.0 |
137
- | AITZ_EM | 81.9 |
138
- | Android Control High_EM | 60.1 |
139
- | Android Control Low_EM | 93.7 |
140
- | AndroidWorld_SR | 25.5 |
141
- | MobileMiniWob++_SR | 91.4 |
142
 
143
  ## Requirements
144
  The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command:
@@ -184,25 +144,25 @@ from qwen_vl_utils import process_vision_info
184
 
185
  # default: Load the model on the available device(s)
186
  model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
187
- "Qwen/Qwen2.5-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
188
  )
189
 
190
  # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
191
  # model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
192
- # "Qwen/Qwen2.5-VL-7B-Instruct",
193
  # torch_dtype=torch.bfloat16,
194
  # attn_implementation="flash_attention_2",
195
  # device_map="auto",
196
  # )
197
 
198
  # default processer
199
- processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
200
 
201
  # The default range for the number of visual tokens per image in the model is 4-16384.
202
  # You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
203
  # min_pixels = 256*28*28
204
  # max_pixels = 1280*28*28
205
- # processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
206
 
207
  messages = [
208
  {
@@ -471,7 +431,7 @@ The model supports a wide range of resolution inputs. By default, it uses the na
471
  min_pixels = 256 * 28 * 28
472
  max_pixels = 1280 * 28 * 28
473
  processor = AutoProcessor.from_pretrained(
474
- "Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
475
  )
476
  ```
477
 
@@ -521,6 +481,7 @@ To handle extensive inputs exceeding 32,768 tokens, we utilize [YaRN](https://ar
521
 
522
  For supported frameworks, you could add the following to `config.json` to enable YaRN:
523
 
 
524
  {
525
  ...,
526
  "type": "yarn",
@@ -532,6 +493,7 @@ For supported frameworks, you could add the following to `config.json` to enable
532
  "factor": 4,
533
  "original_max_position_embeddings": 32768
534
  }
 
535
 
536
  However, it should be noted that this method has a significant impact on the performance of temporal and spatial localization tasks, and is therefore not recommended for use.
537
 
@@ -539,7 +501,6 @@ At the same time, for long video inputs, since MRoPE itself is more economical w
539
 
540
 
541
 
542
-
543
  ## Citation
544
 
545
  If you find our work helpful, feel free to give us a cite.
@@ -567,4 +528,3 @@ If you find our work helpful, feel free to give us a cite.
567
  year={2023}
568
  }
569
  ```
570
-
 
1
  ---
2
+ base_model:
3
+ - Qwen/Qwen2.5-VL-3B-Instruct
4
+ ---
5
+
6
+ ---
7
+ license_name: qwen-research
8
+ license_link: https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct/blob/main/LICENSE
9
  language:
10
  - en
 
11
  pipeline_tag: image-text-to-text
 
12
  tags:
13
  - multimodal
 
 
14
  - unsloth
15
+ library_name: transformers
 
16
  ---
17
+
18
+ # Qwen2.5-VL-3B-Instruct
19
+ <a href="https://chat.qwenlm.ai/" target="_blank" style="margin: 2px;">
20
+ <img alt="Chat" src="https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5" style="display: inline-block; vertical-align: middle;"/>
21
+ </a>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
  ## Introduction
24
 
 
52
  We enhance both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM.
53
 
54
 
55
+ We have three models with 3, 7 and 72 billion parameters. This repo contains the instruction-tuned 3B Qwen2.5-VL model. For more information, visit our [Blog](https://qwenlm.github.io/blog/qwen2.5-vl/) and [GitHub](https://github.com/QwenLM/Qwen2.5-VL).
56
 
57
 
58
 
 
60
 
61
  ### Image benchmark
62
 
63
+ | Benchmark | InternVL2.5-4B |Qwen2-VL-7B |Qwen2.5-VL-3B |
64
+ | :--- | :---: | :---: | :---: |
65
+ | MMMU<sub>val</sub> | 52.3 | 54.1 | 53.1|
66
+ | MMMU-Pro<sub>val</sub> | **32.7** | 30.5 | 31.6|
67
+ | AI2D<sub>test</sub> | 81.4 | **83.0** | 81.5 |
68
+ | DocVQA<sub>test</sub> | 91.6 | 94.5 | **93.9** |
69
+ | InfoVQA<sub>test</sub> | 72.1 | 76.5 | **77.1** |
70
+ | TextVQA<sub>val</sub> | 76.8 | **84.3** | 79.3|
71
+ | MMBench-V1.1<sub>test</sub> | 79.3 | **80.7** | 77.6 |
72
+ | MMStar | 58.3 | **60.7** | 55.9 |
73
+ | MathVista<sub>testmini</sub> | 60.5 | 58.2 | **62.3** |
74
+ | MathVision<sub>full</sub> | 20.9 | 16.3 | **21.2** |
75
+
76
+
77
+ ### Video benchmark
78
+ | Benchmark | InternVL2.5-4B | Qwen2-VL-7B | Qwen2.5-VL-3B |
79
+ | :--- | :---: | :---: | :---: |
80
+ | MVBench | 71.6 | 67.0 | 67.0 |
81
+ | VideoMME | 63.6/62.3 | 69.0/63.3 | 67.6/61.5 |
82
+ | MLVU | 48.3 | - | 68.2 |
83
+ | LVBench | - | - | 43.3 |
84
+ | MMBench-Video | 1.73 | 1.44 | 1.63 |
85
+ | EgoSchema | - | - | 64.8 |
86
+ | PerceptionTest | - | - | 66.9 |
87
+ | TempCompass | - | - | 64.4 |
88
+ | LongVideoBench | 55.2 | 55.6 | 54.2 |
89
+ | CharadesSTA/mIoU | - | - | 38.8 |
90
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
 
92
  ### Agent benchmark
93
+ | Benchmarks | Qwen2.5-VL-3B |
94
  |-------------------------|---------------|
95
+ | ScreenSpot | 55.5 |
96
+ | ScreenSpot Pro | 23.9 |
97
+ | AITZ_EM | 76.9 |
98
+ | Android Control High_EM | 63.7 |
99
+ | Android Control Low_EM | 22.2 |
100
+ | AndroidWorld_SR | 90.8 |
101
+ | MobileMiniWob++_SR | 67.9 |
102
 
103
  ## Requirements
104
  The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command:
 
144
 
145
  # default: Load the model on the available device(s)
146
  model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
147
+ "Qwen/Qwen2.5-VL-3B-Instruct", torch_dtype="auto", device_map="auto"
148
  )
149
 
150
  # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
151
  # model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
152
+ # "Qwen/Qwen2.5-VL-3B-Instruct",
153
  # torch_dtype=torch.bfloat16,
154
  # attn_implementation="flash_attention_2",
155
  # device_map="auto",
156
  # )
157
 
158
  # default processer
159
+ processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")
160
 
161
  # The default range for the number of visual tokens per image in the model is 4-16384.
162
  # You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
163
  # min_pixels = 256*28*28
164
  # max_pixels = 1280*28*28
165
+ # processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
166
 
167
  messages = [
168
  {
 
431
  min_pixels = 256 * 28 * 28
432
  max_pixels = 1280 * 28 * 28
433
  processor = AutoProcessor.from_pretrained(
434
+ "Qwen/Qwen2.5-VL-3B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
435
  )
436
  ```
437
 
 
481
 
482
  For supported frameworks, you could add the following to `config.json` to enable YaRN:
483
 
484
+ ```
485
  {
486
  ...,
487
  "type": "yarn",
 
493
  "factor": 4,
494
  "original_max_position_embeddings": 32768
495
  }
496
+ ```
497
 
498
  However, it should be noted that this method has a significant impact on the performance of temporal and spatial localization tasks, and is therefore not recommended for use.
499
 
 
501
 
502
 
503
 
 
504
  ## Citation
505
 
506
  If you find our work helpful, feel free to give us a cite.
 
528
  year={2023}
529
  }
530
  ```
 
chat_template.jinja ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system
2
+ You are a helpful assistant.<|im_end|>
3
+ {% endif %}<|im_start|>{{ message['role'] }}
4
+ {% if message['content'] is string %}{{ message['content'] }}<|im_end|>
5
+ {% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>
6
+ {% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant
7
+ {% endif %}
config.json CHANGED
@@ -1,5 +1,4 @@
1
  {
2
- "_name_or_path": "Qwen/Qwen2.5-VL-3B-Instruct",
3
  "architectures": [
4
  "Qwen2_5_VLForConditionalGeneration"
5
  ],
@@ -10,7 +9,7 @@
10
  "image_token_id": 151655,
11
  "initializer_range": 0.02,
12
  "intermediate_size": 11008,
13
- "max_position_embeddings": 32768,
14
  "max_window_layers": 70,
15
  "model_type": "qwen2_5_vl",
16
  "num_attention_heads": 16,
@@ -29,21 +28,76 @@
29
  },
30
  "rope_theta": 1000000.0,
31
  "sliding_window": 32768,
32
- "tie_word_embeddings": true,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  "torch_dtype": "bfloat16",
34
- "transformers_version": "4.49.0",
35
  "unsloth_fixed": true,
36
  "use_cache": true,
37
  "use_sliding_window": false,
38
  "video_token_id": 151656,
39
  "vision_config": {
 
 
 
 
 
 
 
 
40
  "hidden_size": 1280,
 
41
  "in_chans": 3,
 
 
42
  "model_type": "qwen2_5_vl",
 
43
  "out_hidden_size": 2048,
 
 
44
  "spatial_patch_size": 14,
 
45
  "tokens_per_second": 2,
46
- "torch_dtype": "bfloat16"
 
47
  },
48
  "vision_end_token_id": 151653,
49
  "vision_start_token_id": 151652,
 
1
  {
 
2
  "architectures": [
3
  "Qwen2_5_VLForConditionalGeneration"
4
  ],
 
9
  "image_token_id": 151655,
10
  "initializer_range": 0.02,
11
  "intermediate_size": 11008,
12
+ "max_position_embeddings": 128000,
13
  "max_window_layers": 70,
14
  "model_type": "qwen2_5_vl",
15
  "num_attention_heads": 16,
 
28
  },
29
  "rope_theta": 1000000.0,
30
  "sliding_window": 32768,
31
+ "text_config": {
32
+ "architectures": [
33
+ "Qwen2_5_VLForConditionalGeneration"
34
+ ],
35
+ "attention_dropout": 0.0,
36
+ "bos_token_id": 151643,
37
+ "eos_token_id": 151645,
38
+ "hidden_act": "silu",
39
+ "hidden_size": 2048,
40
+ "image_token_id": null,
41
+ "initializer_range": 0.02,
42
+ "intermediate_size": 11008,
43
+ "max_position_embeddings": 128000,
44
+ "max_window_layers": 70,
45
+ "model_type": "qwen2_5_vl_text",
46
+ "num_attention_heads": 16,
47
+ "num_hidden_layers": 36,
48
+ "num_key_value_heads": 2,
49
+ "rms_norm_eps": 1e-06,
50
+ "rope_scaling": {
51
+ "mrope_section": [
52
+ 16,
53
+ 24,
54
+ 24
55
+ ],
56
+ "rope_type": "default",
57
+ "type": "default"
58
+ },
59
+ "rope_theta": 1000000.0,
60
+ "sliding_window": 32768,
61
+ "tie_word_embeddings": true,
62
+ "torch_dtype": "bfloat16",
63
+ "use_cache": true,
64
+ "use_sliding_window": false,
65
+ "video_token_id": null,
66
+ "vision_end_token_id": 151653,
67
+ "vision_start_token_id": 151652,
68
+ "vision_token_id": 151654,
69
+ "vocab_size": 151936
70
+ },
71
  "torch_dtype": "bfloat16",
72
+ "transformers_version": "4.52.0.dev0",
73
  "unsloth_fixed": true,
74
  "use_cache": true,
75
  "use_sliding_window": false,
76
  "video_token_id": 151656,
77
  "vision_config": {
78
+ "depth": 32,
79
+ "fullatt_block_indexes": [
80
+ 7,
81
+ 15,
82
+ 23,
83
+ 31
84
+ ],
85
+ "hidden_act": "silu",
86
  "hidden_size": 1280,
87
+ "in_channels": 3,
88
  "in_chans": 3,
89
+ "initializer_range": 0.02,
90
+ "intermediate_size": 3420,
91
  "model_type": "qwen2_5_vl",
92
+ "num_heads": 16,
93
  "out_hidden_size": 2048,
94
+ "patch_size": 14,
95
+ "spatial_merge_size": 2,
96
  "spatial_patch_size": 14,
97
+ "temporal_patch_size": 2,
98
  "tokens_per_second": 2,
99
+ "torch_dtype": "bfloat16",
100
+ "window_size": 112
101
  },
102
  "vision_end_token_id": 151653,
103
  "vision_start_token_id": 151652,
generation_config.json CHANGED
@@ -5,11 +5,9 @@
5
  151645,
6
  151643
7
  ],
8
- "max_length": 32768,
9
  "pad_token_id": 151654,
10
  "repetition_penalty": 1.05,
11
- "temperature": 0.1,
12
- "top_k": 1,
13
- "top_p": 0.001,
14
- "transformers_version": "4.49.0"
15
  }
 
5
  151645,
6
  151643
7
  ],
8
+ "max_length": 128000,
9
  "pad_token_id": 151654,
10
  "repetition_penalty": 1.05,
11
+ "temperature": 1e-06,
12
+ "transformers_version": "4.52.0.dev0"
 
 
13
  }
model-00001-of-00002.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:41a8895c164b4d32bae6b302f4603fcbc1797f32dafa45c7e9bcda23c6755df8
3
- size 3982649232
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6b45c7afe391b4d9cc49f1ed3f6976f4a25ed40aa2165ed2ae118ff549355985
3
+ size 4997750760
model-00002-of-00002.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:365531ff8752420e89dee707b79d021fb2d6e25abafe486f080555a4fe6972e4
3
- size 3526688744
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d4578eeedb5bac3eab03fed443adbf31c3566bf02ba9ed185d0be0b0671c9550
3
+ size 2511587184
model.safetensors.index.json CHANGED
@@ -65,9 +65,9 @@
65
  "model.layers.12.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
66
  "model.layers.12.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
67
  "model.layers.13.input_layernorm.weight": "model-00001-of-00002.safetensors",
68
- "model.layers.13.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
69
- "model.layers.13.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
70
- "model.layers.13.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
71
  "model.layers.13.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
72
  "model.layers.13.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
73
  "model.layers.13.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
@@ -76,78 +76,78 @@
76
  "model.layers.13.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
77
  "model.layers.13.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
78
  "model.layers.13.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
79
- "model.layers.14.input_layernorm.weight": "model-00002-of-00002.safetensors",
80
- "model.layers.14.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
81
- "model.layers.14.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
82
- "model.layers.14.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
83
- "model.layers.14.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
84
- "model.layers.14.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
85
- "model.layers.14.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
86
- "model.layers.14.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
87
- "model.layers.14.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
88
- "model.layers.14.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
89
- "model.layers.14.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
90
- "model.layers.14.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
91
- "model.layers.15.input_layernorm.weight": "model-00002-of-00002.safetensors",
92
- "model.layers.15.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
93
- "model.layers.15.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
94
- "model.layers.15.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
95
- "model.layers.15.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
96
- "model.layers.15.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
97
- "model.layers.15.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
98
- "model.layers.15.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
99
- "model.layers.15.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
100
- "model.layers.15.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
101
- "model.layers.15.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
102
- "model.layers.15.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
103
- "model.layers.16.input_layernorm.weight": "model-00002-of-00002.safetensors",
104
- "model.layers.16.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
105
- "model.layers.16.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
106
- "model.layers.16.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
107
- "model.layers.16.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
108
- "model.layers.16.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
109
- "model.layers.16.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
110
- "model.layers.16.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
111
- "model.layers.16.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
112
- "model.layers.16.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
113
- "model.layers.16.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
114
- "model.layers.16.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
115
- "model.layers.17.input_layernorm.weight": "model-00002-of-00002.safetensors",
116
- "model.layers.17.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
117
- "model.layers.17.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
118
- "model.layers.17.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
119
- "model.layers.17.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
120
- "model.layers.17.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
121
- "model.layers.17.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
122
- "model.layers.17.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
123
- "model.layers.17.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
124
- "model.layers.17.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
125
- "model.layers.17.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
126
- "model.layers.17.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
127
- "model.layers.18.input_layernorm.weight": "model-00002-of-00002.safetensors",
128
- "model.layers.18.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
129
- "model.layers.18.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
130
- "model.layers.18.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
131
- "model.layers.18.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
132
- "model.layers.18.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
133
- "model.layers.18.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
134
- "model.layers.18.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
135
- "model.layers.18.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
136
- "model.layers.18.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
137
- "model.layers.18.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
138
- "model.layers.18.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
139
  "model.layers.19.input_layernorm.weight": "model-00002-of-00002.safetensors",
140
  "model.layers.19.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
141
- "model.layers.19.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
142
- "model.layers.19.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
143
  "model.layers.19.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
144
- "model.layers.19.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
145
- "model.layers.19.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
146
- "model.layers.19.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
147
- "model.layers.19.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
148
- "model.layers.19.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
149
- "model.layers.19.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
150
- "model.layers.19.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
151
  "model.layers.2.input_layernorm.weight": "model-00001-of-00002.safetensors",
152
  "model.layers.2.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
153
  "model.layers.2.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
 
65
  "model.layers.12.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
66
  "model.layers.12.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
67
  "model.layers.13.input_layernorm.weight": "model-00001-of-00002.safetensors",
68
+ "model.layers.13.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
69
+ "model.layers.13.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
70
+ "model.layers.13.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
71
  "model.layers.13.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
72
  "model.layers.13.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
73
  "model.layers.13.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
 
76
  "model.layers.13.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
77
  "model.layers.13.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
78
  "model.layers.13.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
79
+ "model.layers.14.input_layernorm.weight": "model-00001-of-00002.safetensors",
80
+ "model.layers.14.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
81
+ "model.layers.14.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
82
+ "model.layers.14.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
83
+ "model.layers.14.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
84
+ "model.layers.14.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
85
+ "model.layers.14.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
86
+ "model.layers.14.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
87
+ "model.layers.14.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
88
+ "model.layers.14.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
89
+ "model.layers.14.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
90
+ "model.layers.14.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
91
+ "model.layers.15.input_layernorm.weight": "model-00001-of-00002.safetensors",
92
+ "model.layers.15.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
93
+ "model.layers.15.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
94
+ "model.layers.15.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
95
+ "model.layers.15.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
96
+ "model.layers.15.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
97
+ "model.layers.15.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
98
+ "model.layers.15.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
99
+ "model.layers.15.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
100
+ "model.layers.15.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
101
+ "model.layers.15.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
102
+ "model.layers.15.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
103
+ "model.layers.16.input_layernorm.weight": "model-00001-of-00002.safetensors",
104
+ "model.layers.16.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
105
+ "model.layers.16.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
106
+ "model.layers.16.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
107
+ "model.layers.16.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
108
+ "model.layers.16.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
109
+ "model.layers.16.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
110
+ "model.layers.16.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
111
+ "model.layers.16.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
112
+ "model.layers.16.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
113
+ "model.layers.16.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
114
+ "model.layers.16.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
115
+ "model.layers.17.input_layernorm.weight": "model-00001-of-00002.safetensors",
116
+ "model.layers.17.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
117
+ "model.layers.17.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
118
+ "model.layers.17.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
119
+ "model.layers.17.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
120
+ "model.layers.17.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
121
+ "model.layers.17.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
122
+ "model.layers.17.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
123
+ "model.layers.17.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
124
+ "model.layers.17.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
125
+ "model.layers.17.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
126
+ "model.layers.17.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
127
+ "model.layers.18.input_layernorm.weight": "model-00001-of-00002.safetensors",
128
+ "model.layers.18.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
129
+ "model.layers.18.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
130
+ "model.layers.18.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
131
+ "model.layers.18.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
132
+ "model.layers.18.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
133
+ "model.layers.18.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
134
+ "model.layers.18.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
135
+ "model.layers.18.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
136
+ "model.layers.18.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
137
+ "model.layers.18.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
138
+ "model.layers.18.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
139
  "model.layers.19.input_layernorm.weight": "model-00002-of-00002.safetensors",
140
  "model.layers.19.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
141
+ "model.layers.19.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
142
+ "model.layers.19.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
143
  "model.layers.19.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
144
+ "model.layers.19.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
145
+ "model.layers.19.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
146
+ "model.layers.19.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
147
+ "model.layers.19.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
148
+ "model.layers.19.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
149
+ "model.layers.19.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
150
+ "model.layers.19.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
151
  "model.layers.2.input_layernorm.weight": "model-00001-of-00002.safetensors",
152
  "model.layers.2.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
153
  "model.layers.2.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
tokenizer_config.json CHANGED
@@ -195,16 +195,16 @@
195
  "<|video_pad|>"
196
  ],
197
  "bos_token": null,
198
- "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are a helpful assistant.' }}\n {%- endif %}\n {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- else %}\n {{- '<|im_start|>system\\nYou are a helpful assistant.<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n<tool_call>\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n</tool_call>' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n<tool_response>\\n' }}\n {{- message.content }}\n {{- '\\n</tool_response>' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
199
  "clean_up_tokenization_spaces": false,
200
  "eos_token": "<|im_end|>",
201
  "errors": "replace",
202
  "extra_special_tokens": {},
203
- "model_max_length": 32768,
204
  "pad_token": "<|vision_pad|>",
205
  "padding_side": "left",
206
  "processor_class": "Qwen2_5_VLProcessor",
207
  "split_special_tokens": false,
208
  "tokenizer_class": "Qwen2Tokenizer",
209
- "unk_token": null
210
- }
 
 
195
  "<|video_pad|>"
196
  ],
197
  "bos_token": null,
 
198
  "clean_up_tokenization_spaces": false,
199
  "eos_token": "<|im_end|>",
200
  "errors": "replace",
201
  "extra_special_tokens": {},
202
+ "model_max_length": 128000,
203
  "pad_token": "<|vision_pad|>",
204
  "padding_side": "left",
205
  "processor_class": "Qwen2_5_VLProcessor",
206
  "split_special_tokens": false,
207
  "tokenizer_class": "Qwen2Tokenizer",
208
+ "unk_token": null,
209
+ "chat_template": "{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|im_end|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}"
210
+ }