Upload 6 files

Browse files

Files changed (6) hide show

README.md +285 -3
convert_vision_encoder.py +73 -0
export_vision_onnx.py +97 -0
rkllm-convert.py +1 -1
run_rkllm.py +10 -4
vision_encoder.rknn +2 -2

README.md CHANGED Viewed

@@ -1,7 +1,289 @@
 ---
-license: agpl-3.0
 ---
-(Placeholder for a document)
-NOTE: The vision encoder is currently broken for this model, you can try on it but expect degraded results!

 ---
+base_model:
+- Qwen/Qwen2.5-VL-3B-Instruct
+tags:
+- rknn
+- rkllm
 ---
+# Qwen2.5-VL-3B-Instruct-RKLLM
+## (English README see below)
+在RK3588上运行强大的Qwen2.5-VL-3B-Instruct-RKLLM视觉大模型!
+- 推理速度(RK3588): 视觉编码器 3.4s(三核并行) + LLM 填充 2.3s (320 tokens / 138 tps) + 解码 8.2 tps
+- 内存占用(RK3588, 上下文长度1024): 6.1GB
+## 使用方法
+1. 克隆或者下载此仓库到本地. 模型较大, 请确保有足够的磁盘空间.
+2. 开发板的RKNPU2内核驱动版本必须>=0.9.6才能运行这么大的模型.
+   使用root权限运行以下命令检查驱动版本:
+   ```bash
+   > cat /sys/kernel/debug/rknpu/version
+   RKNPU driver: v0.9.8
+   ```
+   如果版本过低, 请更新驱动. 你可能需要更新内核, 或查找官方文档以获取帮助.
+3. 安装依赖
+```bash
+pip install "numpy<2" opencv-python rknn-toolkit-lite2
+```
+4. 运行
+```bash
+python ./run_rkllm.py ./test.jpg ./vision_encoder.rknn ./language_model_w8a8.rkllm 512 1024 3
+```
+参数说明:
+- `512`: max_new_tokens, 最大生成token数.
+- `1024`: max_context_len, 最大上下文长度.
+- `3`: npu_core_num, 使用的NPU核心数.
+如果实测性能不理想, 可以调整CPU调度器让CPU始终运行在最高频率, 并把推理程序绑定到大核(`taskset -c 4-7 python ...`)
+test.jpg:
+![test.jpg](./test.jpg)
+```
+Initializing ONNX Runtime for vision encoder...
+W rknn-toolkit-lite2 version: 2.3.2
+W Query dynamic range failed. Ret code: RKNN_ERR_MODEL_INVALID. (If it is a static shape RKNN model, please ignore the above warning message.)
+Vision encoder loaded successfully.
+ONNX Input: pixel_values, ONNX Output: vision_features
+Initializing RKLLM Runtime...
+I rkllm: rkllm-runtime version: 1.2.1, rknpu driver version: 0.9.8, platform: RK3588
+I rkllm: loading rkllm model from ./language_model_w8a8.rkllm
+I rkllm: rkllm-toolkit version: 1.2.1, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: W8A8
+I rkllm: Enabled cpus: [4, 5, 6, 7]
+I rkllm: Enabled cpus num: 4
+I rkllm: Using mrope
+RKLLM initialized successfully.
+Preprocessing image...
+Running vision encoder...
+W The input[0] need NHWC data format, but NCHW set, the data format and data buffer will be changed to NHWC.
+视觉编码器推理耗时: 3.5427 秒
+Image encoded successfully.
+I rkllm: reset chat template:
+I rkllm: system_prompt: <|im_start|>system\nYou are a helpful assistant.<|im_end|>\n
+I rkllm: prompt_prefix: <|im_start|>user\n
+I rkllm: prompt_postfix: <|im_end|>\n<|im_start|>assistant\n
+W rkllm: Calling rkllm_set_chat_template will disable the internal automatic chat template parsing, including enable_thinking. Make sure your custom prompt is complete and valid.
+**********************可输入以下问题对应序号获取回答/或自定义输入********************
+[0] Picture 1: <image> What is in the image?
+[1] Picture 1: <image> 这张图片中有什么？
+*************************************************************************
+user: 0
+Picture 1: <image> What is in the image?
+robot: n_image_tokens:  289
+The image shows a cozy bedroom with several notable features:
+- A large bed covered with a blue comforter.
+- A wooden dresser next to the bed, topped with various items including a mirror and some decorative objects.
+- A window allowing natural light into the room, offering a view of greenery outside.
+- A bookshelf filled with numerous books on shelves.
+- A basket placed near the foot of the bed.
+- A lamp on a side table beside the bed.
+The overall ambiance is warm and inviting.
+I rkllm: --------------------------------------------------------------------------------------
+I rkllm:  Model init time (ms)  3361.48
+I rkllm: --------------------------------------------------------------------------------------
+I rkllm:  Stage         Total Time (ms)  Tokens    Time per Token (ms)      Tokens per Second
+I rkllm: --------------------------------------------------------------------------------------
+I rkllm:  Prefill       2201.45          321       6.86                     145.81
+I rkllm:  Generate      12419.47         102       121.76                   8.21
+I rkllm: --------------------------------------------------------------------------------------
+I rkllm:  Peak Memory Usage (GB)
+I rkllm:  6.19
+I rkllm: --------------------------------------------------------------------------------------
+user: 1
+Picture 1: <image> 这张图片中有什么？
+robot: n_image_tokens:  289
+这张照片展示了一个卧室的内部。房间有一扇大窗户，可以看到外面的绿色植物。房间里有各种物品：一个蓝色的大床单覆盖在一张床上；一盏��放在梳妆台上；一面镜子挂在墙上；书架上摆满了书籍和一些装饰品；还有一些篮子、花盆和其他小物件散落在周围。
+I rkllm: --------------------------------------------------------------------------------------
+I rkllm:  Stage         Total Time (ms)  Tokens    Time per Token (ms)      Tokens per Second
+I rkllm: --------------------------------------------------------------------------------------
+I rkllm:  Prefill       184.35           13        14.18                    70.52
+I rkllm:  Generate      8711.49          72        120.99                   8.26
+I rkllm: --------------------------------------------------------------------------------------
+I rkllm:  Peak Memory Usage (GB)
+I rkllm:  6.19
+I rkllm: --------------------------------------------------------------------------------------
+```
+## 模型转换
+#### 准备工作
+1. 安装rknn-toolkit2以及rkllm-toolkit:
+```bash
+pip install -U rknn-toolkit2
+```
+rkllm-toolkit需要在这里手动下载: https://github.com/airockchip/rknn-llm/tree/main/rkllm-toolkit
+2. 下载此仓库到本地, 但不需要下载`.rkllm`和`.rknn`结尾的模型文件.
+3. 下载Qwen2.5-VL-3B-Instruct的huggingface模型仓库到本地. ( https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct )
+#### 转换LLM
+将`rkllm-convert.py`拷贝到Qwen2.5-VL-3B-Instruct的模型文件夹中，执行:
+```bash
+python rkllm-convert.py
+```
+默认是w8a8量化的，你可以自行打开脚本修改量化方式等。
+#### 转换视觉编码器
+1. 导出ONNX
+将`export_vision_onnx.py`拷贝到Qwen2.5-VL-3B-Instruct的模型文件夹根目录中，然后**在该根目录**下执行:
+```bash
+mkdir vision
+python ./export_vision_onnx.py . --savepath ./vision/vision_encoder.onnx
+```
+视觉编码器会导出到`vision/vision_encoder.onnx`. 默认宽高为476，你可以自行通过`--height`和`--width`参数修改。
+2. 模型优化 (可选)
+从 https://github.com/happyme531/rknn-toolkit2-utils 下载`split_matmul_onnx_profile.py`, 之后运行:
+```bash
+python ./split_matmul_onnx_profile.py --input vision/vision_encoder.onnx --output vision_encoder_opt.onnx  --pattern "/visual/blocks\..*?/mlp/down_proj.*" --factor 5
+```
+优化后的模型会输出到`vision_encoder_opt.onnx`
+3. 转换rknn
+```bash
+python ./convert_vision_encoder.py ./vision_encoder_opt.onnx
+```
+(这一步可能需要20分钟以上)
+转换后模型会输出到`vision_encoder_opt.rknn`
+为了与"使用方法"中的命令保持一致, 你可以将其重命名:
+```bash
+mv vision_encoder_opt.rknn vision_encoder.rknn
+```
+## 已知问题
+- 由于RKLLM的多模态输入的限制, 在整个对话中只能加载一张图片.
+- 没有实现多轮对话.
+- RKLLM的w8a8量化貌似存在不小的精度损失.
+- 可能由于RKNPU2的访存模式问题，输入尺寸边长不为64的整数倍时模型运行速度会有奇怪的明显提升。
+## 参考
+- [Qwen/Qwen2.5-VL-3B-Instruct-RKLLM](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct-RKLLM)
+---
+# English README
+Run the powerful Qwen2.5-VL-3B-Instruct-RKLLM vision large model on RK3588!
+- **Inference Speed (RK3588)**: Vision Encoder 3.4s (3-core parallel) + LLM Prefill 2.3s (320 tokens / 138 tps) + Decode 8.2 tps
+- **Memory Usage (RK3588, context length 1024)**: 6.1GB
+## How to Use
+1.  Clone or download this repository locally. The model is large, so ensure you have enough disk space.
+2.  The RKNPU2 kernel driver version on your board must be `>=0.9.6` to run such a large model. Run the following command with root privileges to check the driver version:
+    ```bash
+    > cat /sys/kernel/debug/rknpu/version
+    RKNPU driver: v0.9.8
+    ```
+    If the version is too old, please update the driver. You may need to update your kernel or consult the official documentation for help.
+3.  Install dependencies:
+    ```bash
+    pip install "numpy<2" opencv-python rknn-toolkit-lite2
+    ```
+4.  Run the model:
+    ```bash
+    python ./run_rkllm.py ./test.jpg ./vision_encoder.rknn ./language_model_w8a8.rkllm 512 1024 3
+    ```
+    **Parameter Descriptions:**
+    - `512`: `max_new_tokens`, the maximum number of tokens to generate.
+    - `1024`: `max_context_len`, the maximum context length.
+    - `3`: `npu_core_num`, the number of NPU cores to use.
+If the performance is not ideal, you can adjust the CPU scheduler to keep the CPU running at its highest frequency and bind the inference program to the big cores (`taskset -c 4-7 python ...`).
+The example output is shown in the Chinese section above.
+## Model Conversion
+#### Prerequisites
+1.  Install rknn-toolkit2 and rkllm-toolkit:
+    ```bash
+    pip install -U rknn-toolkit2
+    ```
+    rkllm-toolkit needs to be downloaded manually from here: https://github.com/airockchip/rknn-llm/tree/main/rkllm-toolkit
+2.  Download this repository locally, but you don't need the model files ending with `.rkllm` and `.rknn`.
+3.  Download the Qwen2.5-VL-3B-Instruct huggingface model repository locally from: https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct
+#### Convert LLM
+Copy `rkllm-convert.py` into the Qwen2.5-VL-3B-Instruct model folder and execute:
+```bash
+python rkllm-convert.py
+```
+It uses w8a8 quantization by default. You can open the script to modify the quantization method and other settings.
+#### Convert Vision Encoder
+1.  **Export ONNX**
+    Copy `export_vision_onnx.py` to the root directory of the Qwen2.5-VL-3B-Instruct model folder, then execute the following **in the root directory**:
+    ```bash
+    mkdir vision
+    python ./export_vision_onnx.py . --savepath ./vision/vision_encoder.onnx
+    ```
+    The vision encoder will be exported to `vision/vision_encoder.onnx`. The default height and width are 476, which you can modify using the `--height` and `--width` parameters.
+2.  **Model Optimization (Optional)**
+    Download `split_matmul_onnx_profile.py` from https://github.com/happyme531/rknn-toolkit2-utils, then run:
+    ```bash
+    python ./split_matmul_onnx_profile.py --input vision/vision_encoder.onnx --output vision_encoder_opt.onnx  --pattern "/visual/blocks\..*?/mlp/down_proj.*" --factor 5
+    ```
+    The optimized model will be saved as `vision_encoder_opt.onnx`.
+3.  **Convert to RKNN**
+    ```bash
+    python ./convert_vision_encoder.py ./vision_encoder_opt.onnx
+    ```
+    (This step may take over 20 minutes)
+    The converted model will be saved as `vision_encoder_opt.rknn`. To match the command in the "How to Use" section, you can rename it:
+    ```bash
+    mv vision_encoder_opt.rknn vision_encoder.rknn
+    ```
+## Known Issues
+- Due to limitations in RKLLM's multimodal input, only one image can be loaded per conversation.
+- Multi-turn conversation is not implemented.
+- The w8a8 quantization in RKLLM seems to cause a non-trivial loss of precision.
+- Possibly due to memory access patterns of the RKNPU2, weirdly the model runs faster when the input image dimensions are not multiples of 64.
+## References
+- [Qwen/Qwen2.5-VL-3B-Instruct-RKLLM](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct-RKLLM)

convert_vision_encoder.py ADDED Viewed

	@@ -0,0 +1,73 @@

+#!/usr/bin/env python
+# coding: utf-8
+import datetime
+import argparse
+from rknn.api import RKNN
+from sys import exit
+parser = argparse.ArgumentParser(description='Convert ONNX to RKNN model.')
+parser.add_argument('onnx_model', type=str, help='Path to the input ONNX model file.')
+args = parser.parse_args()
+ONNX_MODEL = args.onnx_model
+RKNN_MODEL = ONNX_MODEL.replace(".onnx", ".rknn")
+DATASET = "/home/zt/rk3588-nn/rknn_model_zoo/datasets/COCO/coco_subset_20.txt"
+QUANTIZE = False
+detailed_performance_log = True
+timedate_iso = datetime.datetime.now().isoformat()
+rknn = RKNN(verbose=True)
+rknn.config(
+    # mean_values=[x * 255 for x in [0.485, 0.456, 0.406]],
+    # std_values=[x * 255 for x in [0.229, 0.224, 0.225]],
+    quantized_dtype="w8a8",
+    quantized_algorithm="normal",
+    quantized_method="channel",
+    quantized_hybrid_level=0,
+    target_platform="rk3588",
+    quant_img_RGB2BGR=False,
+    float_dtype="float16",
+    optimization_level=3,
+    custom_string=f"converted by: email: [email protected] at {timedate_iso}",
+    remove_weight=False,
+    compress_weight=False,
+    inputs_yuv_fmt=None,
+    single_core_mode=False,
+    # dynamic_input=[  #这个和下面的inputs + input_size_list二选一
+    #     [
+    #         [1, 3, 240, 320],
+    #         # ...
+    #     ],
+    #     [
+    #         [1, 3, 480, 640],
+    #         # ...
+    #     ],
+    #     [
+    #         [1, 3, 960, 1280],
+    #         # ...
+    #     ],
+    # ],
+    model_pruning=False,
+    op_target={'Gather':'cpu'},
+    quantize_weight=False,
+    remove_reshape=False,
+    sparse_infer=False,
+    enable_flash_attention=False,
+    # 隐藏的参数
+    # disable_rules=[],
+    # sram_prefer=False,
+    # nbuf_prefer=False,
+    # check_data=[],
+)
+ret = rknn.load_onnx(model=ONNX_MODEL)
+ret = rknn.build(do_quantization=QUANTIZE, dataset=DATASET, rknn_batch_size=None)
+ret = rknn.export_rknn(RKNN_MODEL)
+# ret = rknn.init_runtime(target='rk3588',core_mask=RKNN.NPU_CORE_0,perf_debug=detailed_performance_log)
+# rknn.eval_perf()
+# ret = rknn.accuracy_analysis(inputs=['processed_images_rknn.npy'], target='rk3588')

export_vision_onnx.py ADDED Viewed

	@@ -0,0 +1,97 @@

+import argparse
+import torch
+from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer
+def build_patches_and_grid(pixel_values, temporal_patch_size, patch_size, merge_size):
+    assert pixel_values.dim() == 4, "pixel_values 必须是 (N, C, H, W)"
+    N, C, H, W = pixel_values.shape
+    if H % patch_size != 0 or W % patch_size != 0:
+        raise ValueError(f"H({H}) 与 W({W}) 必须能被 patch_size({patch_size}) 整除")
+    if (H // patch_size) % merge_size != 0 or (W // patch_size) % merge_size != 0:
+        raise ValueError(
+            f"(H/patch_size, W/patch_size)=({H//patch_size},{W//patch_size}) 必须能被 merge_size({merge_size}) 整除"
+        )
+    if N == 1:
+        pixel_values = pixel_values.repeat(temporal_patch_size, 1, 1, 1)
+    elif N % temporal_patch_size != 0:
+        repeat_time = temporal_patch_size - (N % temporal_patch_size)
+        repeat_image = pixel_values[-1:, ...].repeat(repeat_time, 1, 1, 1)
+        pixel_values = torch.cat((pixel_values, repeat_image), dim=0)
+    grid_t = pixel_values.shape[0] // temporal_patch_size
+    grid_h = H // patch_size
+    grid_w = W // patch_size
+    patches = pixel_values.reshape(
+        grid_t,
+        temporal_patch_size,
+        C,
+        grid_h // merge_size,
+        merge_size,
+        patch_size,
+        grid_w // merge_size,
+        merge_size,
+        patch_size,
+    )
+    patches = patches.permute(0, 3, 6, 4, 7, 2, 1, 5, 8)
+    flatten_patches = patches.reshape(
+        grid_t * grid_h * grid_w, C * temporal_patch_size * patch_size * patch_size
+    )
+    grid_thw = torch.tensor([[grid_t, grid_h, grid_w]], dtype=torch.int32, device=flatten_patches.device)
+    return flatten_patches, grid_thw
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument('path', type=str, help='模型路径')
+    parser.add_argument('--batch', type=int, default=1, required=False, help='batch size')
+    parser.add_argument('--height', type=int, default=476, required=False, help='图像高度')
+    parser.add_argument('--width', type=int, default=476, required=False, help='图像宽度')
+    parser.add_argument('--savepath', type=str, default='vision_encoder.onnx', required=False, help='保存路径')
+    args = parser.parse_args()
+    model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+        args.path,
+        torch_dtype=torch.float32,
+        low_cpu_mem_usage=True,
+        trust_remote_code=True,
+        attn_implementation="eager",
+    ).eval()
+    _ = AutoTokenizer.from_pretrained(args.path, trust_remote_code=True, use_fast=False)
+    vcfg = model.visual.config
+    merge_size = int(vcfg.spatial_merge_size)
+    patch_size = int(vcfg.patch_size)
+    temporal_patch_size = int(vcfg.temporal_patch_size)
+    # 构造输入
+    N, C, H, W = int(args.batch), 3, int(args.height), int(args.width)
+    pixel_values = torch.randn(N, C, H, W, dtype=torch.float32)
+    with torch.no_grad():
+        fp, gthw = build_patches_and_grid(pixel_values, temporal_patch_size, patch_size, merge_size)
+        vision_features = model.visual(fp, gthw)
+        print(f"视觉特征维度: {vision_features.shape}")
+        print(f"视觉token数量: {vision_features.shape[0]}")
+    def top_forward(pixel_values_in):
+        fp, gthw = build_patches_and_grid(pixel_values_in, temporal_patch_size, patch_size, merge_size)
+        return model.visual(fp, gthw)
+    model.forward = top_forward
+    torch.onnx.export(
+        model,
+        (pixel_values,),
+        args.savepath,
+        opset_version=17,
+        input_names=["pixel_values"],
+        output_names=["vision_features"],
+    )
+if __name__ == '__main__':
+    main()

rkllm-convert.py CHANGED Viewed

@@ -17,7 +17,7 @@ if ret != 0:
     exit(ret)
 # Export rkllm model
-ret = llm.export_rkllm("./language_model.rkllm")
 if ret != 0:
     print('Export model failed!')
     exit(ret)

     exit(ret)
 # Export rkllm model
+ret = llm.export_rkllm("./language_model_w8a8.rkllm")
 if ret != 0:
     print('Export model failed!')
     exit(ret)

run_rkllm.py CHANGED Viewed

@@ -20,8 +20,8 @@ from rkllm_binding import (
 )
 # Constants
-IMAGE_HEIGHT = 448
-IMAGE_WIDTH = 448
 def expand2square(img, background_color):
     """
@@ -69,14 +69,16 @@ def main():
     # The rknn_core_num is not directly used by onnxruntime in the same way,
     # but we keep it for API consistency with the C++ example.
     # ONNX Runtime will manage its own threading and execution providers.
-    parser.add_argument("rknn_core_num", type=int, help="Core number for RKNN (informational for this script).")
     args = parser.parse_args()
     # --- 1. Initialize Image Encoder (ONNX Runtime) ---
     print("Initializing ONNX Runtime for vision encoder...")
     try:
-        ort_session = ort.InferenceSession(args.encoder_model_path)
     except Exception as e:
         print(f"Failed to load ONNX model: {e}")
         sys.exit(1)
@@ -131,8 +133,12 @@ def main():
     # --- 4. Run Image Encoder ---
     print("Running vision encoder...")
     try:
         img_vec_output = ort_session.run([output_name], {input_name: input_tensor.astype(np.float32)})[0]
         # The output from C++ is a flat float array. Let's flatten the ONNX output.
         img_vec = img_vec_output.flatten().astype(np.float32)

 )
 # Constants
+IMAGE_HEIGHT = 476
+IMAGE_WIDTH = 476
 def expand2square(img, background_color):
     """
     # The rknn_core_num is not directly used by onnxruntime in the same way,
     # but we keep it for API consistency with the C++ example.
     # ONNX Runtime will manage its own threading and execution providers.
+    parser.add_argument("rknn_core_num", type=int, help="Sets the number of npu cores used in vision encoder.")
     args = parser.parse_args()
     # --- 1. Initialize Image Encoder (ONNX Runtime) ---
     print("Initializing ONNX Runtime for vision encoder...")
     try:
+        sess_options = ort.SessionOptions()
+        sess_options.intra_op_num_threads = args.rknn_core_num
+        ort_session = ort.InferenceSession(args.encoder_model_path, sess_options=sess_options)
     except Exception as e:
         print(f"Failed to load ONNX model: {e}")
         sys.exit(1)
     # --- 4. Run Image Encoder ---
     print("Running vision encoder...")
+    import time
+    start_time = time.time()
     try:
         img_vec_output = ort_session.run([output_name], {input_name: input_tensor.astype(np.float32)})[0]
+        elapsed_time = time.time() - start_time
+        print(f"视觉编码器推理耗时: {elapsed_time:.4f} 秒")
         # The output from C++ is a flat float array. Let's flatten the ONNX output.
         img_vec = img_vec_output.flatten().astype(np.float32)

vision_encoder.rknn CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:165201488d5abaa6fd8c9d471b6d49ab18c508bf2a4f161a5e7164d18438a23c
-size 1424694394

 version https://git-lfs.github.com/spec/v1
+oid sha256:401402b3cfa6ab292bb7ae51c208f51a14c36cf1a534ab5392b24efc315fb60f
+size 1557737667