File size: 7,276 Bytes
d383d62 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 |
---
license: mit
language:
- en
- zh
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- Qwen2.5-VL
- Qwen2.5-VL-3B-Instruct
- Int8
- VLM
---
# Qwen2.5-VL-3B-Instruct
This version of Qwen2.5-VL-3B-Instruct has been converted to run on the Axera NPU using **w8a16** quantization.
This model has been optimized with the following LoRA:
Compatible with Pulsar2 version: 3.4
## Convert tools links:
For those who are interested in model conversion, you can try to export axmodel through the original repo :
https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct
[Pulsar2 Link, How to Convert LLM from Huggingface to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/appendix/build_llm.html)
[AXera NPU HOST LLM Runtime](https://github.com/AXERA-TECH/Qwen2.5-VL-3B-Instruct.axera)
## Support Platform
- AX650
- AX650N DEMO Board
- [M4N-Dock(爱芯派Pro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html)
- [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html)
**Image Process**
|Chips| input size | image num | image encoder | ttft(320 tokens) | w8a16 | DDR | Flash |
|--|--|--|--|--|--|--|--|
|AX650| 448*448 | 1 | 780 ms | 420 ms | 6.2 tokens/sec| 4.3 GiB | 4.6 GiB |
**Video Process**
|Chips| input size | image num | image encoder |ttft(512 tokens) | w8a16 | DDR | Flash |
|--|--|--|--|--|--|--|--|
|AX650| 308*308 | 8 | 1400 ms | 5400 ms | 6.1 tokens/sec| 4.4 GiB | 4.7 GiB |
## How to use
Download all files from this repository to the device
**If you using AX650 Board**
```
root@ax650:/mnt/qtang/llm-test/qwen2.5-vl-3b# tree -L 2
.
├── image
│ └── ssd_car.jpg
├── main
├── python
│ ├── cv_resize.py
│ ├── infer_image.py
│ ├── infer_text.py
│ ├── infer_video.py
│ ├── preprocess.py
│ └── utils.py
├── qwen2_5-vl-3b-image-ax650
│ ├── Qwen2.5-VL-3B-Instruct_vision_nchw448.axmodel
│ ├── model.embed_tokens.weight.bfloat16.bin
│ ├── qwen2_5_vl_p320_l0_together.axmodel
......
│ ├── qwen2_5_vl_p320_l9_together.axmodel
│ └── qwen2_5_vl_post.axmodel
├── qwen2_5-vl-3b-video-ax650
│ ├── Qwen2.5-VL-3B-Instruct_vision_nhwc.axmodel
│ ├── model.embed_tokens.weight.bfloat16.bin
│ ├── qwen2_5_vl_p512_l0_together.axmodel
......
│ ├── qwen2_5_vl_p512_l9_together.axmodel
│ └── qwen2_5_vl_post.axmodel
├── qwen2_5-vl-tokenizer
│ ├── chat_template.json
│ ├── config.json
│ ├── generation_config.json
│ ├── merges.txt
│ ├── model.safetensors.index.json
│ ├── preprocessor_config.json
│ ├── tokenizer.json
│ ├── tokenizer_config.json
│ └── vocab.json
├── qwen2_tokenizer_image_448.py
├── qwen2_tokenizer_video_308.py
├── run_qwen2_5_vl_image.sh
├── run_qwen2_5_vl_video.sh
└── video
├── frame_0075.jpg
......
└── frame_0089.jpg
```
#### Install transformer
```
pip install transformers==4.41.1
```
#### Start the Tokenizer service
**If you using image process**
- input text
```
描述下图片
```
- input image

```
root@ax650:/mnt/qtang/llm-test/qwen2.5-vl-3b# ./run_qwen2_5_vl_image.sh
[I][ Init][ 129]: LLM init start
bos_id: -1, eos_id: 151645
2% | █ | 1 / 40 [0.01s<0.24s, 166.67 count/s] tokenizer init ok
[I][ Init][ 26]: LLaMaEmbedSelector use mmap
100% | ████████████████████████████████ | 40 / 40 [38.23s<38.23s, 1.05 count/s] init vpm axmodel ok,remain_cmm(7600 MB)
[I][ Init][ 277]: max_token_len : 1023
[I][ Init][ 282]: kv_cache_size : 256, kv_cache_num: 1023
[I][ Init][ 290]: prefill_token_num : 320
[I][ Init][ 292]: vpm_height : 1024,vpm_width : 392
[I][ Init][ 301]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
prompt >> who are you?
image >>
[I][ Run][ 638]: ttft: 2854.47 ms
I am a large language model created by Alibaba Cloud. I am called Qwen.
[N][ Run][ 779]: hit eos,avg 6.05 token/s
prompt >> 描述下图片
image >> image/ssd_car.jpg
[I][ Encode][ 416]: image encode time : 795.614014 ms, size : 524288
[I][ Run][ 638]: ttft: 2856.88 ms
这张图片展示了一条繁忙的城市街道。前景中,一名女子站在人行道上,她穿着黑色外套,面带微笑。她旁边是一辆红色的双层巴士,巴士上有一个广告,
上面写着“THINGS GET MORE EXITING WHEN YOU SAY ‘YES’”。巴士的车牌号是“L15”。巴士旁边停着一辆黑色的小型货车。背景中可以看到一些商店和行人,
街道两旁的建筑物是现代的玻璃幕墙建筑。整体氛围显得繁忙而充满活力。
[N][ Run][ 779]: hit eos,avg 5.96 token/s
```
**If you using video process**
```
root@ax650:/mnt/qtang/llm-test/qwen2.5-vl-3b# ./run_qwen2_5_vl_video.sh
[I][ Init][ 129]: LLM init start
bos_id: -1, eos_id: 151645
2% | █ | 1 / 40 [0.00s<0.12s, 333.33 count/s] tokenizer init ok
[I][ Init][ 26]: LLaMaEmbedSelector use mmap
100% | ████████████████████████████████ | 40 / 40 [40.05s<40.05s, 1.00 count/s] init vpm axmodel ok,remain_cmm(7680 MB)
[I][ Init][ 277]: max_token_len : 1023
[I][ Init][ 282]: kv_cache_size : 256, kv_cache_num: 1023
[I][ Init][ 290]: prefill_token_num : 512
[I][ Init][ 292]: vpm_height : 484,vpm_width : 392
[I][ Init][ 301]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
prompt >> 描述这个视频
image >> video
video/frame_0075.jpg
video/frame_0077.jpg
video/frame_0079.jpg
video/frame_0081.jpg
video/frame_0083.jpg
video/frame_0085.jpg
video/frame_0087.jpg
video/frame_0089.jpg
[I][ Encode][ 416]: image encode time : 1488.392944 ms, size : 991232
[I][ Run][ 638]: ttft: 5487.22 ms
视频显示的是一个城市街道的场景。时间戳显示为2月26日,地点是xxx。视频中,一名穿着深色外套和牛仔裤的男子正在推着一个行李箱。
突然,他似乎被什么东西绊倒,随后他摔倒在地。背景中可以看到一个广告牌,上面有一个绿色的图案,旁边停着一辆电动车。街道两旁有建筑物和树木,天气看起来有些阴沉。
[N][ Run][ 779]: hit eos,avg 5.94 token/s
```
#### Inference with M.2 Accelerator card
What is M.2 Accelerator card?, Show this DEMO based on Raspberry PI 5.
TODO
|