FlashVL-2B-Dynamic-ISS

[📜 FlashVL]

image/png

Introduction

We are excited to introduce FlashVL, a novel approach to optimizing Vision-Language Models (VLMs) for real-time applications, targeting ultra-low latency and high throughput without sacrificing accuracy. Leveraging advanced architectural enhancements and efficient computational strategies, Flash-VL 2B is designed to maximize throughput by reducing processing time while maintaining competitive performance across multiple vision-language benchmarks. Our approach includes tailored architectural choices, token compression mechanisms, data curation, training schemes, and a novel image processing technique called implicit semantic stitching that effectively balances computational load and model performance. Through extensive evaluations on 11 standard VLM benchmarks, we demonstrate that Flash-VL 2B achieves state-of-the-art results in both speed and accuracy, making it a promising solution for deployment in resource-constrained environments and large-scale real-time applications.

Environment Setup

pip install torch==2.1.2
pip install transformers==4.50.0.dev0

How to use it?

import torch
from PIL import Image
import requests
from io import BytesIO
from transformers import AutoModel, AutoTokenizer, CLIPImageProcessor

model_path = "FlashVL/FlashVL-2B-Dynamic-ISS"
model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16,trust_remote_code=True,device_map='cuda')
model.tokenizer = AutoTokenizer.from_pretrained(model_path,device_map='cuda')
model.im_trans = CLIPImageProcessor.from_pretrained(model_path)

# single-image single-round conversation (单图单轮对话)
image_url ="https://s3plus.meituan.net/automl-datasets/mlm/0516.png"
response = requests.get(image_url)
image_data = BytesIO(response.content)
pil_image = Image.open(image_data).convert('RGB')   
messages = [{'role': 'user', 'content': "生成图中菜品的菜谱"}] # answer: EXTRA
answer = model.chat(pil_image, messages, do_sample=False, max_new_tokens=256)
print(answer)

# single-image multi-round conversation (单图多轮对话)
messages = [
    {'role': 'user', 'content': '这是什么'},
    {"role": "assistant", "content": '这是一道看起来像是银耳莲子汤的甜品。\
     银耳是一种常见的食材,通常用于制作甜品和汤品,具有软糯的口感和清润的口感。莲 \
     子是莲子的干燥部分,常用于中医和食疗中,具有补脾止泻的功效。图片中还可以看到 \
     一些枸杞和核桃,枸杞富含维生素和抗氧化物质,核桃则提供丰富的蛋白质和健康脂肪。 \
     整体来看,这道甜品不仅美味,还具有一定的营养价值。'},
    {'role': 'user', 'content': '对图中菜品卡路里分析'}
    ]
answer = model.chat(pil_image, messages, do_sample=False, max_new_tokens=256)
print(answer)

# pure-text single-round conversation (纯文本对话)
messages = [{'role': 'user', 'content': "who are you"}]
answer = model.chat(None, messages, do_sample=False, max_new_tokens=256)
print(answer)

Evaluation

Benchmark Qwen2-VL-2B Aquila-VL-2B InternVL2.5-2B Flash-VL-2Bs Flash-VL-2Bd Flash-VL-2Bd-ISS
MMMUval 41.9 44.4 41.8 43.6 42.9 42.9
MMBenchen 74.9 78.6 74.7 78.4 78.4 79.1
MMBenchcn 73.5 76.3 71.6 74.7 74.9 76.7
MMStar 48.0 54.9 54.1 53.8 54.4 54.1
MathVistatestmini 43.0 59.4 50.9 59.3 58.1 61.5
AI2Dtest 74.1 75.0 75.1 74.2 74.1 74.4
MMVet 49.5 40.9 61.7 47.3 52.7 50.7
HallusionBench 39.2 38.5 42.7 43.5 45.5 49.0
OCRBench 794 773 800 764 831 843
MME 1872 1813 2091 1715 1866 1850
SEEDBench 71.5 78.9 73.2 73.6 73.6 74.5
Average 60.2 62.6 63.6 62.4 64.0 64.8

We use VLMEvalKit to evaluate FlashVL-2B-Static.

Citation

If you find this project useful in your research, please consider citing:

@misc{zhang2025flashvl2boptimizingvisionlanguage,
      title={Flash-VL 2B: Optimizing Vision-Language Model Performance for Ultra-Low Latency and High Throughput}, 
      author={Bo Zhang and Shuo Li and Runhe Tian and Yang Yang and Jixin Tang and Jinhao Zhou and Lin Ma},
      year={2025},
      eprint={2505.09498},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.09498}, 
}
Downloads last month
8
Safetensors
Model size
2.53B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for FlashVL/FlashVL-2B-Dynamic-ISS

Finetuned
(89)
this model

Datasets used to train FlashVL/FlashVL-2B-Dynamic-ISS