AXERA-TECH
/

Qwen2.5-VL-3B-Instruct

+---
+license: mit
+language:
+- en
+- zh
+base_model:
+- Qwen/Qwen2.5-VL-3B-Instruct
+pipeline_tag: image-text-to-text
+library_name: transformers
+tags:
+- Qwen2.5-VL
+- Qwen2.5-VL-3B-Instruct
+- Int8
+- VLM
+---
+# Qwen2.5-VL-3B-Instruct
+This version of Qwen2.5-VL-3B-Instruct has been converted to run on the Axera NPU using **w8a16** quantization.
+This model has been optimized with the following LoRA:
+Compatible with Pulsar2 version: 3.4
+## Convert tools links:
+For those who are interested in model conversion, you can try to export axmodel through the original repo :
+https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct
+[Pulsar2 Link, How to Convert LLM from Huggingface to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/appendix/build_llm.html)
+[AXera NPU HOST LLM Runtime](https://github.com/AXERA-TECH/Qwen2.5-VL-3B-Instruct.axera)
+## Support Platform
+- AX650
+  - AX650N DEMO Board
+  - [M4N-Dock(爱芯派Pro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html)
+  - [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html)
+**Image Process**
+|Chips| input size | image num | image encoder | ttft(320 tokens) | w8a16 | DDR | Flash |
+|--|--|--|--|--|--|--|--|
+|AX650| 448*448 | 1 | 780 ms | 420 ms | 6.2 tokens/sec| 4.3 GiB |  4.6 GiB  |
+**Video Process**
+|Chips| input size | image num | image encoder |ttft(512 tokens) | w8a16 | DDR | Flash |
+|--|--|--|--|--|--|--|--|
+|AX650| 308*308 | 8  | 1400 ms | 5400 ms | 6.1 tokens/sec| 4.4 GiB |  4.7 GiB  |
+## How to use
+Download all files from this repository to the device
+**If you using AX650 Board**
+```
+root@ax650:/mnt/qtang/llm-test/qwen2.5-vl-3b# tree -L 2
+.
+├── image
+│   └── ssd_car.jpg
+├── main
+├── python
+│   ├── cv_resize.py
+│   ├── infer_image.py
+│   ├── infer_text.py
+│   ├── infer_video.py
+│   ├── preprocess.py
+│   └── utils.py
+├── qwen2_5-vl-3b-image-ax650
+│   ├── Qwen2.5-VL-3B-Instruct_vision_nchw448.axmodel
+│   ├── model.embed_tokens.weight.bfloat16.bin
+│   ├── qwen2_5_vl_p320_l0_together.axmodel
+......
+│   ├── qwen2_5_vl_p320_l9_together.axmodel
+│   └── qwen2_5_vl_post.axmodel
+├── qwen2_5-vl-3b-video-ax650
+│   ├── Qwen2.5-VL-3B-Instruct_vision_nhwc.axmodel
+│   ├── model.embed_tokens.weight.bfloat16.bin
+│   ├── qwen2_5_vl_p512_l0_together.axmodel
+......
+│   ├── qwen2_5_vl_p512_l9_together.axmodel
+│   └── qwen2_5_vl_post.axmodel
+├── qwen2_5-vl-tokenizer
+│   ├── chat_template.json
+│   ├── config.json
+│   ├── generation_config.json
+│   ├── merges.txt
+│   ├── model.safetensors.index.json
+│   ├── preprocessor_config.json
+│   ├── tokenizer.json
+│   ├── tokenizer_config.json
+│   └── vocab.json
+├── qwen2_tokenizer_image_448.py
+├── qwen2_tokenizer_video_308.py
+├── run_qwen2_5_vl_image.sh
+├── run_qwen2_5_vl_video.sh
+└── video
+    ├── frame_0075.jpg
+......
+    └── frame_0089.jpg
+```
+#### Install transformer
+```
+pip install transformers==4.41.1
+```
+#### Start the Tokenizer service
+**If you using image process**
+- input text
+```
+描述下图片
+```
+- input image
+![](./image/ssd_car.jpg)
+```
+root@ax650:/mnt/qtang/llm-test/qwen2.5-vl-3b# ./run_qwen2_5_vl_image.sh
+[I][                            Init][ 129]: LLM init start
+bos_id: -1, eos_id: 151645
+  2% | █                                 |   1 /  40 [0.01s<0.24s, 166.67 count/s] tokenizer init ok
+[I][                            Init][  26]: LLaMaEmbedSelector use mmap
+100% | ████████████████████████████████ |  40 /  40 [38.23s<38.23s, 1.05 count/s] init vpm axmodel ok,remain_cmm(7600 MB)
+[I][                            Init][ 277]: max_token_len : 1023
+[I][                            Init][ 282]: kv_cache_size : 256, kv_cache_num: 1023
+[I][                            Init][ 290]: prefill_token_num : 320
+[I][                            Init][ 292]: vpm_height : 1024,vpm_width : 392
+[I][                            Init][ 301]: LLM init ok
+Type "q" to exit, Ctrl+c to stop current running
+prompt >> who are you?
+image >>
+[I][                             Run][ 638]: ttft: 2854.47 ms
+I am a large language model created by Alibaba Cloud. I am called Qwen.
+[N][                             Run][ 779]: hit eos,avg 6.05 token/s
+prompt >> 描述下图片
+image >> image/ssd_car.jpg
+[I][                          Encode][ 416]: image encode time : 795.614014 ms, size : 524288
+[I][                             Run][ 638]: ttft: 2856.88 ms
+这张图片展示了一条繁忙的城市街道。前景中，一名女子站在人行道上，她穿着黑色外套，面带微笑。她旁边是一辆红色的��层巴士，巴士上有一个广告，
+上面写着“THINGS GET MORE EXITING WHEN YOU SAY ‘YES’”。巴士的车牌号是“L15”。巴士旁边停着一辆黑色的小型货车。背景中可以看到一些商店和行人，
+街道两旁的建筑物是现代的玻璃幕墙建筑。整体氛围显得繁忙而充满活力。
+[N][                             Run][ 779]: hit eos,avg 5.96 token/s
+```
+**If you using video process**
+```
+root@ax650:/mnt/qtang/llm-test/qwen2.5-vl-3b# ./run_qwen2_5_vl_video.sh
+[I][                            Init][ 129]: LLM init start
+bos_id: -1, eos_id: 151645
+  2% | █                                 |   1 /  40 [0.00s<0.12s, 333.33 count/s] tokenizer init ok
+[I][                            Init][  26]: LLaMaEmbedSelector use mmap
+100% | ████████████████████████████████ |  40 /  40 [40.05s<40.05s, 1.00 count/s] init vpm axmodel ok,remain_cmm(7680 MB)
+[I][                            Init][ 277]: max_token_len : 1023
+[I][                            Init][ 282]: kv_cache_size : 256, kv_cache_num: 1023
+[I][                            Init][ 290]: prefill_token_num : 512
+[I][                            Init][ 292]: vpm_height : 484,vpm_width : 392
+[I][                            Init][ 301]: LLM init ok
+Type "q" to exit, Ctrl+c to stop current running
+prompt >> 描述这个视频
+image >> video
+video/frame_0075.jpg
+video/frame_0077.jpg
+video/frame_0079.jpg
+video/frame_0081.jpg
+video/frame_0083.jpg
+video/frame_0085.jpg
+video/frame_0087.jpg
+video/frame_0089.jpg
+[I][                          Encode][ 416]: image encode time : 1488.392944 ms, size : 991232
+[I][                             Run][ 638]: ttft: 5487.22 ms
+视频显示的是一个城市街道的场景。时间戳显示为2月26日，地点是xxx。视频中，一名穿着深色外套和牛仔裤的男子正在推着一个行李箱。
+突然，他似乎被什么东西绊倒，随后他摔倒在地。背景中可以看到一个广告牌，上面有一个绿色的图案，旁边停着一辆电动车。街道两旁有建筑物和树木，天气看起来有些阴沉。
+[N][                             Run][ 779]: hit eos,avg 5.94 token/s
+```
+#### Inference with M.2 Accelerator card
+What is M.2 Accelerator card?, Show this DEMO based on Raspberry PI 5.
+TODO