Add files using upload-large-folder tool

Browse files

Files changed (6) hide show

README.md +52 -96
chat_template.jinja +7 -0
config.json +99 -43
generation_config.json +3 -5
model.safetensors +2 -2
tokenizer_config.json +4 -4

README.md CHANGED Viewed

@@ -1,60 +1,21 @@
 ---
-base_model: Qwen/Qwen2.5-VL-3B-Instruct
 language:
 - en
-library_name: transformers
 pipeline_tag: image-text-to-text
-license: apache-2.0
 tags:
 - multimodal
-- qwen
-- qwen2
 - unsloth
-- transformers
-- vision
 ---
-<div>
-  <p style="margin-bottom: 0;">
-    <em>Unsloth's <a href="https://unsloth.ai/blog/dynamic-4bit">Dynamic 4-bit Quants</a> is selectively quantized, greatly improving accuracy over standard 4-bit.</em>
-  </p>
-  <div style="display: flex; gap: 5px; align-items: center; ">
-    <a href="https://github.com/unslothai/unsloth/">
-      <img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="133">
-    </a>
-    <a href="https://discord.gg/unsloth">
-      <img src="https://github.com/unslothai/unsloth/raw/main/images/Discord%20button.png" width="173">
-    </a>
-    <a href="https://docs.unsloth.ai/">
-      <img src="https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/documentation%20green%20button.png" width="143">
-    </a>
-  </div>
-<h1 style="margin-top: 0rem;">Finetune LLMs 2-5x faster with 70% less memory via Unsloth</h2>
-</div>
-We have a free Google Colab Tesla T4 notebook for Qwen2-VL (7B) here: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2_VL_(7B)-Vision.ipynb
-## ✨ Finetune for Free
-All notebooks are **beginner friendly**! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face.
-| Unsloth supports          |    Free Notebooks                                                                                           | Performance | Memory use |
-|-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------|
-| **Llama-3.2 (3B)**      | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb)               | 2.4x faster | 58% less |
-| **Llama-3.2 (11B vision)**      | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)               | 2x faster | 60% less |
-| **Qwen2 VL (7B)**      | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2_VL_(7B)-Vision.ipynb)               | 1.8x faster | 60% less |
-| **Qwen2.5 (7B)**      | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2.5_(7B)-Alpaca.ipynb)               | 2x faster | 60% less |
-| **Llama-3.1 (8B)**      | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-Alpaca.ipynb)               | 2.4x faster | 58% less |
-| **Phi-3.5 (mini)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Phi_3.5_Mini-Conversational.ipynb)               | 2x faster | 50% less |
-| **Gemma 2 (9B)**      | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma2_(9B)-Alpaca.ipynb)               | 2.4x faster | 58% less |
-| **Mistral (7B)**    | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_v0.3_(7B)-Conversational.ipynb)               | 2.2x faster | 62% less |
-[<img src="https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/documentation%20green%20button.png" width="200"/>](https://docs.unsloth.ai)
-- This [Llama 3.2 conversational notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb) is useful for ShareGPT ChatML / Vicuna templates.
-- This [text completion notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_(7B)-Text_Completion.ipynb) is for raw text. This [DPO notebook](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing) replicates Zephyr.
-- \* Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster.
-# Qwen2.5-VL
 ## Introduction
@@ -88,7 +49,7 @@ We extend dynamic resolution to the temporal dimension by adopting dynamic FPS s
 We enhance both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM.
-We have three models with 3, 7 and 72 billion parameters. This repo contains the instruction-tuned 7B Qwen2.5-VL model. For more information, visit our [Blog](https://qwenlm.github.io/blog/qwen2.5-vl/) and [GitHub](https://github.com/QwenLM/Qwen2.5-VL).
@@ -96,50 +57,45 @@ We have three models with 3, 7 and 72 billion parameters. This repo contains the
 ### Image benchmark
-| Benchmark | InternVL2.5-8B | MiniCPM-o 2.6 | GPT-4o-mini | Qwen2-VL-7B |**Qwen2.5-VL-7B** |
-| :--- | :---: | :---: | :---: | :---: | :---: |
-| MMMU<sub>val</sub>  | 56 | 50.4 | **60**| 54.1 | 58.6|
-| MMMU-Pro<sub>val</sub>  | 34.3 | - | 37.6| 30.5 | 41.0|
-| DocVQA<sub>test</sub>  | 93 | 93 | - | 94.5 | **95.7** |
-| InfoVQA<sub>test</sub>  | 77.6 | - |  - |76.5 | **82.6** |
-| ChartQA<sub>test</sub>  | 84.8 | - |- | 83.0 |**87.3** |
-| TextVQA<sub>val</sub>  | 79.1 | 80.1 | -| 84.3 | **84.9**|
-| OCRBench | 822 | 852 | 785 | 845 | **864** |
-| CC_OCR | 57.7 |  | | 61.6 | **77.8**|
-| MMStar | 62.8| | |60.7| **63.9**|
-| MMBench-V1.1-En<sub>test</sub>  | 79.4 | 78.0 | 76.0| 80.7 | **82.6** |
-| MMT-Bench<sub>test</sub> | - | - | - |**63.7** |63.6 |
-| MMStar | **61.5** | 57.5 |  54.8 | 60.7 |63.9 |
-| MMVet<sub>GPT-4-Turbo</sub>  | 54.2 | 60.0 | 66.9 | 62.0 | **67.1**|
-| HallBench<sub>avg</sub>  | 45.2 | 48.1 | 46.1| 50.6 | **52.9**|
-| MathVista<sub>testmini</sub>  | 58.3 | 60.6 | 52.4 | 58.2 | **68.2**|
-| MathVision  | - | -  | - | 16.3 | **25.07** |
-### Video Benchmarks
-| Benchmark |  Qwen2-VL-7B | **Qwen2.5-VL-7B** |
-| :--- | :---: | :---: |
-| MVBench |  67.0 | **69.6** |
-| PerceptionTest<sub>test</sub>  | 66.9 | **70.5** |
-| Video-MME<sub>wo/w subs</sub>   | 63.3/69.0 | **65.1**/**71.6** |
-| LVBench  |  | 45.3 |
-| LongVideoBench  |  | 54.7 |
-| MMBench-Video | 1.44 | 1.79 |
-| TempCompass |  | 71.7 |
-| MLVU |  | 70.2 |
-| CharadesSTA/mIoU |  43.6|
 ### Agent benchmark
-| Benchmarks              | Qwen2.5-VL-7B |
 |-------------------------|---------------|
-| ScreenSpot              |     84.7    |
-| ScreenSpot Pro          |     29.0    |
-| AITZ_EM                 |  	81.9    |
-| Android Control High_EM |    	60.1    |
-| Android Control Low_EM  |  	93.7    |
-| AndroidWorld_SR         | 	25.5  	|
-| MobileMiniWob++_SR      | 	91.4    |
 ## Requirements
 The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command:
@@ -185,25 +141,25 @@ from qwen_vl_utils import process_vision_info
 # default: Load the model on the available device(s)
 model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
-    "Qwen/Qwen2.5-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
 )
 # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
 # model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
-#     "Qwen/Qwen2.5-VL-7B-Instruct",
 #     torch_dtype=torch.bfloat16,
 #     attn_implementation="flash_attention_2",
 #     device_map="auto",
 # )
 # default processer
-processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
 # The default range for the number of visual tokens per image in the model is 4-16384.
 # You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
 # min_pixels = 256*28*28
 # max_pixels = 1280*28*28
-# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
 messages = [
     {
@@ -472,7 +428,7 @@ The model supports a wide range of resolution inputs. By default, it uses the na
 min_pixels = 256 * 28 * 28
 max_pixels = 1280 * 28 * 28
 processor = AutoProcessor.from_pretrained(
-    "Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
 )
 ```
@@ -522,6 +478,7 @@ To handle extensive inputs exceeding 32,768 tokens, we utilize [YaRN](https://ar
 For supported frameworks, you could add the following to `config.json` to enable YaRN:
 {
 	...,
     "type": "yarn",
@@ -533,6 +490,7 @@ For supported frameworks, you could add the following to `config.json` to enable
     "factor": 4,
     "original_max_position_embeddings": 32768
 }
 However, it should be noted that this method has a significant impact on the performance of temporal and spatial localization tasks, and is therefore not recommended for use.
@@ -540,7 +498,6 @@ At the same time, for long video inputs, since MRoPE itself is more economical w
 ## Citation
 If you find our work helpful, feel free to give us a cite.
@@ -568,4 +525,3 @@ If you find our work helpful, feel free to give us a cite.
   year={2023}
 }
 ```

 ---
+base_model:
+- Qwen/Qwen2.5-VL-3B-Instruct
+license_name: qwen-research
+license_link: https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct/blob/main/LICENSE
 language:
 - en
 pipeline_tag: image-text-to-text
 tags:
 - multimodal
 - unsloth
+library_name: transformers
 ---
+# Qwen2.5-VL-3B-Instruct
+<a href="https://chat.qwenlm.ai/" target="_blank" style="margin: 2px;">
+    <img alt="Chat" src="https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5" style="display: inline-block; vertical-align: middle;"/>
+</a>
 ## Introduction
 We enhance both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM.
+We have three models with 3, 7 and 72 billion parameters. This repo contains the instruction-tuned 3B Qwen2.5-VL model. For more information, visit our [Blog](https://qwenlm.github.io/blog/qwen2.5-vl/) and [GitHub](https://github.com/QwenLM/Qwen2.5-VL).
 ### Image benchmark
+| Benchmark | InternVL2.5-4B |Qwen2-VL-7B |Qwen2.5-VL-3B |
+| :--- | :---:  | :---: | :---: |
+| MMMU<sub>val</sub>  | 52.3 | 54.1 | 53.1|
+| MMMU-Pro<sub>val</sub>  | **32.7** | 30.5 | 31.6|
+| AI2D<sub>test</sub> | 81.4 | **83.0** | 81.5 |
+| DocVQA<sub>test</sub>  | 91.6 | 94.5 | **93.9** |
+| InfoVQA<sub>test</sub>  | 72.1 | 76.5 | **77.1** |
+| TextVQA<sub>val</sub>  | 76.8 | **84.3** | 79.3|
+| MMBench-V1.1<sub>test</sub>  | 79.3 | **80.7** | 77.6 |
+| MMStar | 58.3 | **60.7** | 55.9 |
+| MathVista<sub>testmini</sub>  | 60.5 | 58.2 | **62.3** |
+| MathVision<sub>full</sub>  | 20.9 | 16.3  | **21.2** |
+### Video benchmark
+| Benchmark | InternVL2.5-4B | Qwen2-VL-7B | Qwen2.5-VL-3B |
+| :--- | :---:  | :---: | :---: |
+| MVBench | 71.6 | 67.0 | 67.0 |
+| VideoMME | 63.6/62.3 | 69.0/63.3 | 67.6/61.5 |
+| MLVU | 48.3 | - | 68.2 |
+| LVBench | - | - | 43.3 |
+| MMBench-Video | 1.73 | 1.44 | 1.63 |
+| EgoSchema | - | - | 64.8 |
+| PerceptionTest | - | - | 66.9 |
+| TempCompass | - | - | 64.4 |
+| LongVideoBench | 55.2 | 55.6 | 54.2 |
+| CharadesSTA/mIoU | - | - | 38.8 |
 ### Agent benchmark
+| Benchmarks              | Qwen2.5-VL-3B |
 |-------------------------|---------------|
+| ScreenSpot              |     55.5    |
+| ScreenSpot Pro          |     23.9    |
+| AITZ_EM                 |  	76.9    |
+| Android Control High_EM |    	63.7    |
+| Android Control Low_EM  |  	22.2    |
+| AndroidWorld_SR         | 	90.8  	|
+| MobileMiniWob++_SR      | 	67.9    |
 ## Requirements
 The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command:
 # default: Load the model on the available device(s)
 model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+    "Qwen/Qwen2.5-VL-3B-Instruct", torch_dtype="auto", device_map="auto"
 )
 # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
 # model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+#     "Qwen/Qwen2.5-VL-3B-Instruct",
 #     torch_dtype=torch.bfloat16,
 #     attn_implementation="flash_attention_2",
 #     device_map="auto",
 # )
 # default processer
+processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")
 # The default range for the number of visual tokens per image in the model is 4-16384.
 # You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
 # min_pixels = 256*28*28
 # max_pixels = 1280*28*28
+# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
 messages = [
     {
 min_pixels = 256 * 28 * 28
 max_pixels = 1280 * 28 * 28
 processor = AutoProcessor.from_pretrained(
+    "Qwen/Qwen2.5-VL-3B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
 )
 ```
 For supported frameworks, you could add the following to `config.json` to enable YaRN:
+```
 {
 	...,
     "type": "yarn",
     "factor": 4,
     "original_max_position_embeddings": 32768
 }
+```
 However, it should be noted that this method has a significant impact on the performance of temporal and spatial localization tasks, and is therefore not recommended for use.
 ## Citation
 If you find our work helpful, feel free to give us a cite.
   year={2023}
 }
 ```

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,7 @@

+{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system
+You are a helpful assistant.<|im_end|>
+{% endif %}<|im_start|>{{ message['role'] }}
+{% if message['content'] is string %}{{ message['content'] }}<|im_end|>
+{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>
+{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant
+{% endif %}

config.json CHANGED Viewed

@@ -1,5 +1,4 @@
 {
-  "_name_or_path": "unsloth/Qwen2.5-VL-3B-Instruct",
   "architectures": [
     "Qwen2_5_VLForConditionalGeneration"
   ],
@@ -10,7 +9,7 @@
   "image_token_id": 151655,
   "initializer_range": 0.02,
   "intermediate_size": 11008,
-  "max_position_embeddings": 32768,
   "max_window_layers": 70,
   "model_type": "qwen2_5_vl",
   "num_attention_heads": 16,
@@ -31,74 +30,76 @@
       "multi_modal_projector",
       "merger",
       "modality_projection",
-      "visual.merger.mlp",
-      "visual.blocks.28.attn",
       "model.layers.5.mlp",
       "visual.blocks.25.attn",
       "visual.blocks.26.attn",
-      "visual.blocks.31.attn",
       "visual.blocks.22.attn",
       "visual.blocks.27.attn",
-      "visual.blocks.24.attn",
-      "visual.blocks.29.attn",
-      "visual.blocks.21.attn",
       "visual.blocks.30.mlp",
-      "visual.blocks.30.attn",
-      "visual.blocks.25.mlp",
-      "visual.blocks.27.mlp",
-      "visual.blocks.23.attn",
-      "visual.blocks.20.attn",
       "visual.blocks.29.mlp",
-      "visual.blocks.24.mlp",
       "visual.blocks.18.attn",
-      "visual.blocks.31.mlp",
-      "visual.blocks.19.attn",
       "visual.blocks.26.mlp",
       "visual.blocks.28.mlp",
       "visual.blocks.23.mlp",
-      "visual.blocks.21.mlp",
-      "visual.blocks.13.attn",
       "visual.blocks.17.attn",
       "visual.blocks.20.mlp",
-      "model.layers.2.mlp",
       "visual.blocks.22.mlp",
-      "visual.blocks.18.mlp",
-      "visual.blocks.16.attn",
-      "visual.blocks.12.attn",
-      "visual.blocks.19.mlp",
       "visual.blocks.9.mlp",
-      "visual.blocks.6.mlp",
-      "visual.blocks.16.mlp",
       "visual.blocks.10.mlp",
-      "visual.blocks.9.attn",
       "model.layers.1.mlp",
-      "visual.blocks.10.attn",
       "visual.blocks.14.attn",
-      "visual.blocks.6.attn",
       "visual.blocks.11.mlp",
       "visual.blocks.11.attn",
       "visual.blocks.12.mlp",
-      "visual.blocks.7.mlp",
-      "visual.blocks.2.mlp",
       "visual.blocks.13.mlp",
-      "visual.blocks.8.attn",
-      "visual.blocks.5.mlp",
       "visual.blocks.8.mlp",
       "visual.blocks.14.mlp",
       "visual.blocks.15.mlp",
       "visual.blocks.4.mlp",
-      "visual.blocks.5.attn",
       "visual.blocks.17.mlp",
-      "visual.blocks.3.mlp",
       "visual.blocks.15.attn",
-      "visual.blocks.1.attn",
-      "visual.blocks.1.mlp",
-      "visual.blocks.7.attn",
-      "visual.blocks.2.attn",
       "visual.blocks.4.attn",
-      "visual.blocks.0.mlp",
       "visual.blocks.0.attn",
-      "visual.blocks.3.attn"
     ],
     "llm_int8_threshold": 6.0,
     "load_in_4bit": true,
@@ -117,21 +118,76 @@
   },
   "rope_theta": 1000000.0,
   "sliding_window": 32768,
-  "tie_word_embeddings": true,
   "torch_dtype": "bfloat16",
-  "transformers_version": "4.49.0",
   "unsloth_fixed": true,
   "use_cache": true,
   "use_sliding_window": false,
   "video_token_id": 151656,
   "vision_config": {
     "hidden_size": 1280,
     "in_chans": 3,
     "model_type": "qwen2_5_vl",
     "out_hidden_size": 2048,
     "spatial_patch_size": 14,
     "tokens_per_second": 2,
-    "torch_dtype": "bfloat16"
   },
   "vision_end_token_id": 151653,
   "vision_start_token_id": 151652,

 {
   "architectures": [
     "Qwen2_5_VLForConditionalGeneration"
   ],
   "image_token_id": 151655,
   "initializer_range": 0.02,
   "intermediate_size": 11008,
+  "max_position_embeddings": 128000,
   "max_window_layers": 70,
   "model_type": "qwen2_5_vl",
   "num_attention_heads": 16,
       "multi_modal_projector",
       "merger",
       "modality_projection",
       "model.layers.5.mlp",
       "visual.blocks.25.attn",
+      "visual.merger.mlp",
+      "visual.blocks.24.attn",
+      "visual.blocks.29.attn",
+      "visual.blocks.30.attn",
       "visual.blocks.26.attn",
       "visual.blocks.22.attn",
+      "visual.blocks.31.attn",
       "visual.blocks.27.attn",
+      "model.layers.30.mlp",
       "visual.blocks.30.mlp",
+      "visual.blocks.28.attn",
       "visual.blocks.29.mlp",
+      "visual.blocks.25.mlp",
+      "visual.blocks.21.attn",
       "visual.blocks.18.attn",
+      "visual.blocks.20.attn",
       "visual.blocks.26.mlp",
+      "visual.blocks.16.attn",
+      "visual.blocks.31.mlp",
       "visual.blocks.28.mlp",
+      "visual.blocks.27.mlp",
+      "visual.blocks.24.mlp",
+      "visual.blocks.19.attn",
       "visual.blocks.23.mlp",
+      "visual.blocks.19.mlp",
       "visual.blocks.17.attn",
       "visual.blocks.20.mlp",
+      "visual.blocks.23.attn",
+      "visual.blocks.13.attn",
       "visual.blocks.22.mlp",
       "visual.blocks.9.mlp",
       "visual.blocks.10.mlp",
+      "visual.blocks.16.mlp",
+      "visual.blocks.12.attn",
+      "visual.blocks.18.mlp",
+      "visual.blocks.21.mlp",
+      "visual.blocks.6.mlp",
       "model.layers.1.mlp",
       "visual.blocks.14.attn",
       "visual.blocks.11.mlp",
       "visual.blocks.11.attn",
+      "visual.blocks.9.attn",
+      "model.layers.2.mlp",
       "visual.blocks.12.mlp",
+      "visual.blocks.10.attn",
+      "visual.blocks.6.attn",
       "visual.blocks.13.mlp",
       "visual.blocks.8.mlp",
       "visual.blocks.14.mlp",
+      "visual.blocks.7.mlp",
+      "visual.blocks.5.attn",
+      "visual.blocks.8.attn",
       "visual.blocks.15.mlp",
+      "visual.blocks.5.mlp",
+      "visual.blocks.3.mlp",
+      "visual.blocks.2.mlp",
       "visual.blocks.4.mlp",
+      "visual.blocks.2.attn",
+      "visual.blocks.7.attn",
+      "visual.blocks.1.attn",
       "visual.blocks.17.mlp",
       "visual.blocks.15.attn",
       "visual.blocks.4.attn",
+      "visual.blocks.1.mlp",
       "visual.blocks.0.attn",
+      "visual.blocks.0.mlp",
+      "visual.blocks.3.attn",
+      "visual.blocks.31.mlp.down_proj"
     ],
     "llm_int8_threshold": 6.0,
     "load_in_4bit": true,
   },
   "rope_theta": 1000000.0,
   "sliding_window": 32768,
+  "text_config": {
+    "architectures": [
+      "Qwen2_5_VLForConditionalGeneration"
+    ],
+    "attention_dropout": 0.0,
+    "bos_token_id": 151643,
+    "eos_token_id": 151645,
+    "hidden_act": "silu",
+    "hidden_size": 2048,
+    "image_token_id": null,
+    "initializer_range": 0.02,
+    "intermediate_size": 11008,
+    "max_position_embeddings": 128000,
+    "max_window_layers": 70,
+    "model_type": "qwen2_5_vl_text",
+    "num_attention_heads": 16,
+    "num_hidden_layers": 36,
+    "num_key_value_heads": 2,
+    "rms_norm_eps": 1e-06,
+    "rope_scaling": {
+      "mrope_section": [
+        16,
+        24,
+        24
+      ],
+      "rope_type": "default",
+      "type": "default"
+    },
+    "rope_theta": 1000000.0,
+    "sliding_window": 32768,
+    "tie_word_embeddings": true,
+    "torch_dtype": "bfloat16",
+    "use_cache": true,
+    "use_sliding_window": false,
+    "video_token_id": null,
+    "vision_end_token_id": 151653,
+    "vision_start_token_id": 151652,
+    "vision_token_id": 151654,
+    "vocab_size": 151936
+  },
   "torch_dtype": "bfloat16",
+  "transformers_version": "4.52.0.dev0",
   "unsloth_fixed": true,
   "use_cache": true,
   "use_sliding_window": false,
   "video_token_id": 151656,
   "vision_config": {
+    "depth": 32,
+    "fullatt_block_indexes": [
+      7,
+      15,
+      23,
+      31
+    ],
+    "hidden_act": "silu",
     "hidden_size": 1280,
+    "in_channels": 3,
     "in_chans": 3,
+    "initializer_range": 0.02,
+    "intermediate_size": 3420,
     "model_type": "qwen2_5_vl",
+    "num_heads": 16,
     "out_hidden_size": 2048,
+    "patch_size": 14,
+    "spatial_merge_size": 2,
     "spatial_patch_size": 14,
+    "temporal_patch_size": 2,
     "tokens_per_second": 2,
+    "torch_dtype": "bfloat16",
+    "window_size": 112
   },
   "vision_end_token_id": 151653,
   "vision_start_token_id": 151652,

generation_config.json CHANGED Viewed

@@ -5,11 +5,9 @@
     151645,
     151643
   ],
-  "max_length": 32768,
   "pad_token_id": 151654,
   "repetition_penalty": 1.05,
-  "temperature": 0.1,
-  "top_k": 1,
-  "top_p": 0.001,
-  "transformers_version": "4.49.0"
 }

     151645,
     151643
   ],
+  "max_length": 128000,
   "pad_token_id": 151654,
   "repetition_penalty": 1.05,
+  "temperature": 1e-06,
+  "transformers_version": "4.52.0.dev0"
 }

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:e556b1118e6c6b9fdf3b37804ba4ec3d5ae37aa63013f9e92c4f2859ef481375
-size 3693149642

 version https://git-lfs.github.com/spec/v1
+oid sha256:1b25dbfec33ee11b1efab94b2bd8142b962ad5670dda424a695e1450e60e7f44
+size 3793520611

tokenizer_config.json CHANGED Viewed

@@ -195,16 +195,16 @@
     "<|video_pad|>"
   ],
   "bos_token": null,
-  "chat_template": "{%- if tools %}\n    {{- '<|im_start|>system\\n' }}\n    {%- if messages[0]['role'] == 'system' %}\n        {{- messages[0]['content'] }}\n    {%- else %}\n        {{- 'You are a helpful assistant.' }}\n    {%- endif %}\n    {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n    {%- for tool in tools %}\n        {{- \"\\n\" }}\n        {{- tool | tojson }}\n    {%- endfor %}\n    {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n    {%- if messages[0]['role'] == 'system' %}\n        {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n    {%- else %}\n        {{- '<|im_start|>system\\nYou are a helpful assistant.<|im_end|>\\n' }}\n    {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
   "clean_up_tokenization_spaces": false,
   "eos_token": "<|im_end|>",
   "errors": "replace",
   "extra_special_tokens": {},
-  "model_max_length": 32768,
   "pad_token": "<|vision_pad|>",
   "padding_side": "left",
   "processor_class": "Qwen2_5_VLProcessor",
   "split_special_tokens": false,
   "tokenizer_class": "Qwen2Tokenizer",
-  "unk_token": null
-}

     "<|video_pad|>"
   ],
   "bos_token": null,
   "clean_up_tokenization_spaces": false,
   "eos_token": "<|im_end|>",
   "errors": "replace",
   "extra_special_tokens": {},
+  "model_max_length": 128000,
   "pad_token": "<|vision_pad|>",
   "padding_side": "left",
   "processor_class": "Qwen2_5_VLProcessor",
   "split_special_tokens": false,
   "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null,
+  "chat_template": "{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|im_end|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}"
+}