Qwen3-VL

This version of Qwen3-VL-2B-Instruct and Qwen3-VL-4B-Instruct have been converted to run on the Axera NPU using w8a16 quantization.

Compatible with Pulsar2 version: 5.0

Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo :

Pulsar2 Link, How to Convert LLM from Huggingface to axmodel

AXera NPU HOST LLM Runtime

Support Platform

AX650
- AX650N DEMO Board
- M4N-Dock(爱芯派Pro)
- M.2 Accelerator card

Image Process

Chips	input size	image num	image encoder	ttft(320 tokens)	w8a16
AX650	384*384	1	207 ms	392 ms	9.5 tokens/sec

Video Process

Chips	input size	image num	image encoder	ttft(600 tokens)	w8a16
AX650	384*384	8	725 ms	1045 ms	9.5 tokens/sec

The DDR capacity refers to the CMM memory that needs to be consumed. Ensure that the CMM memory allocation on the development board is greater than this value.

How to use

Download all files from this repository to the device

If you using AX650 Board

Prepare tokenizer server

Install transformer

pip install -r requirements.txt

Demo Run

Image understand demo

start tokenizer server for image understand demo

python3 tokenizer_images.py --port 8080

run image understand demo

input text

描述这张图片

input image

root@ax650 ~/Qwen3-VL # bash run_qwen3_vl_2b_image.sh 
[I][                            Init][ 156]: LLM init start
[I][                            Init][  34]: connect http://127.0.0.1:8080 ok
bos_id: -1, eos_id: 151645
img_start_token: 151652
img_context_token: 151655
  3% | ██                                |   1 /  31 [0.01s<0.31s, 100.00 count/s] tokenizer init ok[I][                            Init][  26]: LLaMaEmbedSelector use mmap
  6% | ███                               |   2 /  31 [0.02s<0.25s, 125.00 count/s] embed_selector init ok[I][                            Init][ 198]: attr.axmodel_num:28
103% | ██████████████████████████████████ |  32 /  31 [6.58s<6.37s, 4.86 count/s] init vpm axmodel ok,remain_cmm(3678 MB)[I][                            Init][ 263]: IMAGE_CONTEXT_TOKEN: 151655, IMAGE_START_TOKEN: 151652
[I][                            Init][ 306]: image encoder output float32

[I][                            Init][ 336]: max_token_len : 2047
[I][                            Init][ 341]: kv_cache_size : 1024, kv_cache_num: 2047
[I][                            Init][ 349]: prefill_token_num : 128
[I][                            Init][ 353]: grp: 1, prefill_max_token_num : 1
[I][                            Init][ 353]: grp: 2, prefill_max_token_num : 128
[I][                            Init][ 353]: grp: 3, prefill_max_token_num : 256
[I][                            Init][ 353]: grp: 4, prefill_max_token_num : 384
[I][                            Init][ 353]: grp: 5, prefill_max_token_num : 512
[I][                            Init][ 353]: grp: 6, prefill_max_token_num : 640
[I][                            Init][ 353]: grp: 7, prefill_max_token_num : 768
[I][                            Init][ 353]: grp: 8, prefill_max_token_num : 896
[I][                            Init][ 353]: grp: 9, prefill_max_token_num : 1024
[I][                            Init][ 353]: grp: 10, prefill_max_token_num : 1152
[I][                            Init][ 357]: prefill_max_token_num : 1152
[I][                            Init][ 366]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
prompt >> 描述这张图片 
image >> images/demo.jpg
[I][                          Encode][ 490]: image encode time : 207.362000 ms, size : 1
[I][                          Encode][ 533]: input_ids size:167
[I][                          Encode][ 541]: offset 15
[I][                          Encode][ 557]: img_embed.size:1, 294912
[I][                          Encode][ 571]: out_embed size:342016
[I][                          Encode][ 573]: position_ids size:167
[I][                             Run][ 591]: input token num : 167, prefill_split_num : 2
[I][                             Run][ 625]: input_num_token:128
[I][                             Run][ 625]: input_num_token:39
[I][                             Run][ 786]: ttft: 391.51 ms
好的，这是一张描绘海滩上温馨场景的照片。

照片捕捉了一个宁静而美好的瞬间：一位女士和她的狗在沙滩上。她们正坐在柔软的沙地上，背景是波光粼粼的海面和柔和的天空。阳光从画面的上方洒下，为整个场景镀上了一层温暖的金色，营造出一种宁静、舒适和幸福的氛围。

女士穿着一件格子衬衫，她正侧身对着镜头，脸上带着温柔的微笑，似乎在与她的狗互动。她的狗，一只毛茸茸的浅色狗，正用前爪轻轻搭在她的腿上，显得非常亲密和信任。狗的尾巴微微翘起，显示出它兴奋和快乐的情绪。

整个画面充满了爱与陪伴的温暖感觉，展现了人与宠物之间深厚的情感联系。

[N][                             Run][ 913]: hit eos,avg 9.35 token/s

prompt >>

Video understand demo

start tokenizer server for image understand demo

python tokenizer_video.py --port 8080

run video understand demo

input text

描述这个视频

input video

./video

root@ax650 ~/Qwen3-VL # bash run_qwen3_vl_2b_video.sh 
[I][                            Init][ 156]: LLM init start
[I][                            Init][  34]: connect http://127.0.0.1:8080 ok
bos_id: -1, eos_id: 151645
img_start_token: 151652
img_context_token: 151656
  3% | ██                                |   1 /  31 [0.01s<0.31s, 100.00 count/s] tokenizer init ok[I][                            Init][  26]: LLaMaEmbedSelector use mmap
  6% | ███                               |   2 /  31 [0.01s<0.20s, 153.85 count/s] embed_selector init ok[I][                            Init][ 198]: attr.axmodel_num:28
103% | ██████████████████████████████████ |  32 /  31 [30.34s<29.39s, 1.05 count/s] init vpm axmodel ok,remain_cmm(3678 MB)[I][                            Init][ 263]: IMAGE_CONTEXT_TOKEN: 151656, IMAGE_START_TOKEN: 151652
[I][                            Init][ 306]: image encoder output float32

[I][                            Init][ 336]: max_token_len : 2047
[I][                            Init][ 341]: kv_cache_size : 1024, kv_cache_num: 2047
[I][                            Init][ 349]: prefill_token_num : 128
[I][                            Init][ 353]: grp: 1, prefill_max_token_num : 1
[I][                            Init][ 353]: grp: 2, prefill_max_token_num : 128
[I][                            Init][ 353]: grp: 3, prefill_max_token_num : 256
[I][                            Init][ 353]: grp: 4, prefill_max_token_num : 384
[I][                            Init][ 353]: grp: 5, prefill_max_token_num : 512
[I][                            Init][ 353]: grp: 6, prefill_max_token_num : 640
[I][                            Init][ 353]: grp: 7, prefill_max_token_num : 768
[I][                            Init][ 353]: grp: 8, prefill_max_token_num : 896
[I][                            Init][ 353]: grp: 9, prefill_max_token_num : 1024
[I][                            Init][ 353]: grp: 10, prefill_max_token_num : 1152
[I][                            Init][ 357]: prefill_max_token_num : 1152
[I][                            Init][ 366]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
prompt >> 描述这个视频
image >> video
video/frame_0000.jpg
video/frame_0008.jpg
video/frame_0016.jpg
video/frame_0024.jpg
video/frame_0032.jpg
video/frame_0040.jpg
video/frame_0048.jpg
video/frame_0056.jpg
[I][                          Encode][ 490]: image encode time : 751.804993 ms, size : 4
[I][                          Encode][ 533]: input_ids size:600
[I][                          Encode][ 541]: offset 15
[I][                          Encode][ 557]: img_embed.size:4, 294912
[I][                          Encode][ 562]: offset:159
[I][                          Encode][ 562]: offset:303
[I][                          Encode][ 562]: offset:447
[I][                          Encode][ 571]: out_embed size:1228800
[I][                          Encode][ 573]: position_ids size:600
[I][                             Run][ 591]: input token num : 600, prefill_split_num : 5
[I][                             Run][ 625]: input_num_token:128
[I][                             Run][ 625]: input_num_token:128
[I][                             Run][ 625]: input_num_token:128
[I][                             Run][ 625]: input_num_token:128
[I][                             Run][ 625]: input_num_token:88
[I][                             Run][ 786]: ttft: 1040.91 ms
根据您提供的图片，这是一段关于两只土拨鼠在山地环境中互动的视频片段。

- **主体**：画面中有两只土拨鼠（也称“山地土拨鼠”或“黑背土拨鼠”），它们正站在一块布满碎石的草地上。它们的毛色为灰褐色与黑色相间，面部有明显的黑色条纹，这是土拨鼠的典型特征。

- **行为**：这两只土拨鼠正进行着一种看似玩耍或社交的互动。它们用前爪互相拍打，身体前倾，姿态充满活力。这种行为在土拨鼠中通常表示友好、玩耍或建立社交联系。

- **环境**：背景是连绵起伏的山脉，山坡上覆盖着绿色的植被，天空晴朗，阳光明媚。整个场景给人一种自然、宁静又充满生机的感觉。

- **视频风格**：从画面的清晰度和动态感来看，这可能是一段慢动作或高清晰度的视频片段，捕捉了土拨鼠活泼、生动的瞬间。

综上所述，这段视频生动地记录了两只土拨鼠在自然山地环境中友好互动的场景，展现了它们活泼、充满活力的天性。

[N][                             Run][ 913]: hit eos,avg 9.44 token/s

prompt >>