Qwen2.5-VL-3B-Instruct / README.md

qqc1989

Update README.md

d383d62 verified 8 months ago

preview code

raw

history blame

7.28 kB

metadata

license: mit
language:
  - en
  - zh
base_model:
  - Qwen/Qwen2.5-VL-3B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
tags:
  - Qwen2.5-VL
  - Qwen2.5-VL-3B-Instruct
  - Int8
  - VLM

Qwen2.5-VL-3B-Instruct

This version of Qwen2.5-VL-3B-Instruct has been converted to run on the Axera NPU using w8a16 quantization.

This model has been optimized with the following LoRA:

Compatible with Pulsar2 version: 3.4

Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo : https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct

Pulsar2 Link, How to Convert LLM from Huggingface to axmodel

AXera NPU HOST LLM Runtime

Support Platform

AX650
- AX650N DEMO Board
- M4N-Dock(爱芯派Pro)
- M.2 Accelerator card

Image Process

Chips	input size	image num	image encoder	ttft(320 tokens)	w8a16	DDR	Flash
AX650	448*448	1	780 ms	420 ms	6.2 tokens/sec	4.3 GiB	4.6 GiB

Video Process

Chips	input size	image num	image encoder	ttft(512 tokens)	w8a16	DDR	Flash
AX650	308*308	8	1400 ms	5400 ms	6.1 tokens/sec	4.4 GiB	4.7 GiB

How to use

Download all files from this repository to the device

If you using AX650 Board

root@ax650:/mnt/qtang/llm-test/qwen2.5-vl-3b# tree -L 2
.
├── image
│   └── ssd_car.jpg
├── main
├── python
│   ├── cv_resize.py
│   ├── infer_image.py
│   ├── infer_text.py
│   ├── infer_video.py
│   ├── preprocess.py
│   └── utils.py
├── qwen2_5-vl-3b-image-ax650
│   ├── Qwen2.5-VL-3B-Instruct_vision_nchw448.axmodel
│   ├── model.embed_tokens.weight.bfloat16.bin
│   ├── qwen2_5_vl_p320_l0_together.axmodel
......
│   ├── qwen2_5_vl_p320_l9_together.axmodel
│   └── qwen2_5_vl_post.axmodel
├── qwen2_5-vl-3b-video-ax650
│   ├── Qwen2.5-VL-3B-Instruct_vision_nhwc.axmodel
│   ├── model.embed_tokens.weight.bfloat16.bin
│   ├── qwen2_5_vl_p512_l0_together.axmodel
......
│   ├── qwen2_5_vl_p512_l9_together.axmodel
│   └── qwen2_5_vl_post.axmodel
├── qwen2_5-vl-tokenizer
│   ├── chat_template.json
│   ├── config.json
│   ├── generation_config.json
│   ├── merges.txt
│   ├── model.safetensors.index.json
│   ├── preprocessor_config.json
│   ├── tokenizer.json
│   ├── tokenizer_config.json
│   └── vocab.json
├── qwen2_tokenizer_image_448.py
├── qwen2_tokenizer_video_308.py
├── run_qwen2_5_vl_image.sh
├── run_qwen2_5_vl_video.sh
└── video
    ├── frame_0075.jpg
......
    └── frame_0089.jpg

Install transformer

pip install transformers==4.41.1

Start the Tokenizer service

If you using image process

input text

描述下图片

input image

root@ax650:/mnt/qtang/llm-test/qwen2.5-vl-3b# ./run_qwen2_5_vl_image.sh
[I][                            Init][ 129]: LLM init start
bos_id: -1, eos_id: 151645
  2% | █                                 |   1 /  40 [0.01s<0.24s, 166.67 count/s] tokenizer init ok
[I][                            Init][  26]: LLaMaEmbedSelector use mmap
100% | ████████████████████████████████ |  40 /  40 [38.23s<38.23s, 1.05 count/s] init vpm axmodel ok,remain_cmm(7600 MB)
[I][                            Init][ 277]: max_token_len : 1023
[I][                            Init][ 282]: kv_cache_size : 256, kv_cache_num: 1023
[I][                            Init][ 290]: prefill_token_num : 320
[I][                            Init][ 292]: vpm_height : 1024,vpm_width : 392
[I][                            Init][ 301]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running

prompt >> who are you?
image >>
[I][                             Run][ 638]: ttft: 2854.47 ms
I am a large language model created by Alibaba Cloud. I am called Qwen.

[N][                             Run][ 779]: hit eos,avg 6.05 token/s

prompt >> 描述下图片
image >> image/ssd_car.jpg
[I][                          Encode][ 416]: image encode time : 795.614014 ms, size : 524288
[I][                             Run][ 638]: ttft: 2856.88 ms
这张图片展示了一条繁忙的城市街道。前景中，一名女子站在人行道上，她穿着黑色外套，面带微笑。她旁边是一辆红色的双层巴士，巴士上有一个广告，
上面写着“THINGS GET MORE EXITING WHEN YOU SAY ‘YES’”。巴士的车牌号是“L15”。巴士旁边停着一辆黑色的小型货车。背景中可以看到一些商店和行人，
街道两旁的建筑物是现代的玻璃幕墙建筑。整体氛围显得繁忙而充满活力。

[N][                             Run][ 779]: hit eos,avg 5.96 token/s

If you using video process

root@ax650:/mnt/qtang/llm-test/qwen2.5-vl-3b# ./run_qwen2_5_vl_video.sh
[I][                            Init][ 129]: LLM init start
bos_id: -1, eos_id: 151645
  2% | █                                 |   1 /  40 [0.00s<0.12s, 333.33 count/s] tokenizer init ok
[I][                            Init][  26]: LLaMaEmbedSelector use mmap
100% | ████████████████████████████████ |  40 /  40 [40.05s<40.05s, 1.00 count/s] init vpm axmodel ok,remain_cmm(7680 MB)
[I][                            Init][ 277]: max_token_len : 1023
[I][                            Init][ 282]: kv_cache_size : 256, kv_cache_num: 1023
[I][                            Init][ 290]: prefill_token_num : 512
[I][                            Init][ 292]: vpm_height : 484,vpm_width : 392
[I][                            Init][ 301]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running

prompt >> 描述这个视频
image >> video
video/frame_0075.jpg
video/frame_0077.jpg
video/frame_0079.jpg
video/frame_0081.jpg
video/frame_0083.jpg
video/frame_0085.jpg
video/frame_0087.jpg
video/frame_0089.jpg
[I][                          Encode][ 416]: image encode time : 1488.392944 ms, size : 991232
[I][                             Run][ 638]: ttft: 5487.22 ms
视频显示的是一个城市街道的场景。时间戳显示为2月26日，地点是xxx。视频中，一名穿着深色外套和牛仔裤的男子正在推着一个行李箱。
突然，他似乎被什么东西绊倒，随后他摔倒在地。背景中可以看到一个广告牌，上面有一个绿色的图案，旁边停着一辆电动车。街道两旁有建筑物和树木，天气看起来有些阴沉。

[N][                             Run][ 779]: hit eos,avg 5.94 token/s

Inference with M.2 Accelerator card

What is M.2 Accelerator card?, Show this DEMO based on Raspberry PI 5.

TODO