Image-Text-to-Text
Transformers
Safetensors
infinite_vl
feature-extraction
vision-language-model
linear-attention
gated-deltanet
infinitevl
multimodal
conversational
custom_code
Instructions to use hustvl/InfiniteVL with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use hustvl/InfiniteVL with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="hustvl/InfiniteVL", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("hustvl/InfiniteVL", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use hustvl/InfiniteVL with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "hustvl/InfiniteVL" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "hustvl/InfiniteVL", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/hustvl/InfiniteVL
- SGLang
How to use hustvl/InfiniteVL with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "hustvl/InfiniteVL" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "hustvl/InfiniteVL", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "hustvl/InfiniteVL" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "hustvl/InfiniteVL", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use hustvl/InfiniteVL with Docker Model Runner:
docker model run hf.co/hustvl/InfiniteVL
| library_name: transformers | |
| license: apache-2.0 | |
| pipeline_tag: image-text-to-text | |
| tags: | |
| - vision-language-model | |
| - linear-attention | |
| - gated-deltanet | |
| - infinitevl | |
| - multimodal | |
| <div align="center"> | |
| <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/Logo.png" width="500" alt="InfiniteVL Logo"> | |
| <h3 align="center"> | |
| 🚀 <b>All Training & Model Code is Open-Sourced!</b> <br> | |
| Welcome your usage and feedback. Please support us with a <a href="https://github.com/hustvl/InfiniteVL">Star 🌟</a> ! | |
| </h3> | |
| <hr> | |
| ### InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models | |
| Hongyuan Tao<sup>1</sup>, | |
| [Bencheng Liao](https://github.com/LegendBC)<sup>1</sup>, | |
| [Shaoyu Chen](https://scholar.google.com/citations?user=PIeNN2gAAAAJ&hl=en&oi=sra)<sup>2</sup>, | |
| Haoran Yin<sup>2</sup>, | |
| [Qian Zhang](https://scholar.google.com/citations?user=pCY-bikAAAAJ&hl=zh-CN)<sup>2</sup>, | |
| [Wenyu Liu](https://scholar.google.com/citations?user=D7jDk7gAAAAJ&hl=en)<sup>1</sup>, | |
| [Xinggang Wang](https://xwcv.github.io)<sup>1,✉️</sup> | |
| <sup>1</sup>Huazhong University of Science and Technology, | |
| <sup>2</sup>Horizon Robotics | |
| (✉️) corresponding author: <a href="mailto:xgwang@hust.edu.cn">xgwang@hust.edu.cn</a> | |
| <br> | |
| <a href="https://arxiv.org/abs/2512.08829"><img src="https://img.shields.io/badge/arXiv-Paper-b31b1b.svg" alt="arXiv"></a> | |
| <a href="https://github.com/hustvl/InfiniteVL"><img src="https://img.shields.io/badge/GitHub-Repository-black?logo=github" alt="GitHub"></a> | |
| <a href="https://huggingface.co/hustvl/InfiniteVL/"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue" alt="Hugging Face"></a> | |
| </div> | |
| ## Introduction | |
| **InfiniteVL** is a novel linear-complexity Vision-Language Model (VLM) architecture designed to overcome the computational bottlenecks of traditional Transformers in processing **unlimited multimodal streams**. | |
| By synergizing **Sliding Window Attention (SWA)** for fine-grained local perception and **Gated DeltaNet** for efficient long-term memory, InfiniteVL achieves a "best of both worlds" balance. It delivers competitive performance on standard benchmarks (comparable to Qwen2.5-VL) while enabling constant-memory inference and high-throughput streaming. | |
| <div align="center"> | |
| <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/image1_new_01.png" width="800" alt="InfiniteVL Logo"> | |
| </div> | |
| ### ✨ Key Highlights | |
| * 🚀 **High Efficiency:** Achieves **>3.6×** inference speedup and constant memory footprint compared to FlashAttention-2 accelerated Transformers. | |
| * ⚡ **Real-Time Streaming:** Sustains a stable **24 FPS** prefill speed on a single **NVIDIA RTX 4090** for continuous video understanding. | |
| * 🧠 **Unlimited Context:** Effectively retains context over extremely long sequences (tested >500K tokens) without OOM errors. | |
| * 🏆 **Strong Performance:** Matches leading Transformer-based VLMs (e.g., Qwen2.5-VL-3B) and significantly outperforms previous linear VLMs (e.g., VL-Mamba, Cobra) on comprehensive aspects. | |
| ## News | |
| * `Dec. 10th, 2025`: We release the **InfiniteVL** model weights and inference code! Please check [Model Zoo](#model-zoo). | |
| * `Dec. 10th, 2025`: We release our paper on [Arxiv](https://arxiv.org/abs/2512.08829). | |
| ## Table of Contents | |
| * [Introduction](#introduction) | |
| * [Key Highlights](#key-highlights) | |
| * [News](#news) | |
| * [Architecture](#architecture) | |
| * [Training Strategy](#training-strategy) | |
| * [Performance](#performance) | |
| * [Model Zoo](#model-zoo) | |
| * [Getting Started](#getting-started) | |
| * [Advanced Usage: CUDA Graph Acceleration](#advanced-usage-cuda-graph-acceleration) | |
| * [Qualitative Analysis & Visualization](#qualitative-analysis--visualization) | |
| * [Contact](#contact) | |
| * [Citation](#citation) | |
| * [Acknowledgement](#acknowledgement) | |
| ## Architecture | |
| <div align="center"> | |
| <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/architecture.png" alt="InfiniteVL Architecture" width="50%"> | |
| </div> | |
| <br> | |
| **InfiniteVL** adopts a hybrid architecture that synergizes the efficiency of linear attention with the precision of window-based attention. The model comprises a **Vision Encoder** (adapted from Qwen2.5-VL), a **Projection MLP**, and a **Decoder-only LLM Backbone**. | |
| ### Key Design Highlights | |
| * **Hybrid Block Design**: The LLM backbone consists of **9 Hybrid Blocks**. Within each block, we strategically interleave: | |
| * **1 Sliding Window Attention (SWA) Layer**: Responsible for capturing high-resolution local context and fine-grained visual details. | |
| * **3 Gated DeltaNet Layers**: Responsible for modeling long-range global dependencies with linear complexity. | |
| * **Constant Memory Footprint**: Unlike traditional Transformers where the Key-Value (KV) cache grows linearly with sequence length ($O(N)$), the **Gated DeltaNet** layers compress history into a fixed-size memory state (e.g., $16 \times 128 \times 256$). This enables **constant memory usage** and constant inference latency, even when processing unlimited input streams. | |
| * **Seamless Integration**: By combining SWA and Gated DeltaNet, InfiniteVL achieves the "best of both worlds": | |
| * Local attention ensures high performance on information-intensive tasks (e.g., OCR, Document Understanding). | |
| * Linear attention ensures efficiency and stability for long-context scenarios (e.g., Streaming Video Understanding). | |
| ## Training Strategy | |
| To achieve strong multimodal performance with minimal training resources, InfiniteVL employs a **three-stage progressive training strategy**. This approach allows our linear-complexity model to inherit the vast knowledge of a Transformer teacher before adapting to long-context scenarios. | |
| <div align="center"> | |
| <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/training_strategy.png" alt="Training Pipeline" width="90%"> | |
| </div> | |
| ### Stage 1: Distillation Pretraining (Efficient Initialization) | |
| * **Goal:** Rapidly transfer knowledge from the **Qwen2.5-VL** teacher to the InfiniteVL student. | |
| * **Method:** We replace the teacher's attention layers with **Gated DeltaNet** while keeping other parameters frozen. We use **Layer-wise MSE Loss** (to align internal states) and **End-to-End KL Divergence** (to align output logits). | |
| * **Significance:** This bypasses the difficulty of training linear attention from scratch, ensuring a robust initialization. | |
| ### Stage 2: Instruction SFT (General Capabilities) | |
| * **Goal:** Unlock strong instruction-following and reasoning capabilities. | |
| * **Data:** **~8M** diverse multimodal instruction pairs, covering General VQA, OCR, Mathematics, and Code. | |
| * **Settings:** Image resolution increased to **1344×1344**; max context length set to 8,192. | |
| * **Outcome:** Produces the **Stage 2 Model**, which offers the best performance on standard benchmarks. | |
| ### Stage 3: Long-Sequence SFT (Context Extension) | |
| * **Goal:** Activate the architecture's potential for **unlimited-length processing** and streaming. | |
| * **Data:** A mixture of Stage 2 data (800K) and **~200K long-sequence samples** (e.g., long videos, multi-page documents). | |
| * **Method:** **LoRA** fine-tuning with context length extended to **32,768**. | |
| * **Outcome:** Produces the **Stage 3 Model**, enabling length extrapolation and stable streaming inference. | |
| ## Performance | |
| ### 🚀 Efficiency & Streaming | |
| **InfiniteVL** is engineered for unlimited-input scenarios. Unlike Transformer-based models where cost grows linearly with history, InfiniteVL maintains **constant** computational cost and memory usage. | |
| > **Hardware Setup:** All efficiency results are measured on a single NVIDIA RTX 4090 GPU. | |
| <div align="center"> | |
| <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/plot_line.png" width="80%" alt="Efficiency Comparison"> | |
| <br> | |
| <em>Figure 1: Comparison of streaming FPS and latency. InfiniteVL sustains real-time performance while Transformer baselines degrade rapidly.</em> | |
| </div> | |
| ### 🏆 Multimodal Benchmarks | |
| InfiniteVL achieves state-of-the-art performance among linear-complexity VLMs. Crucially, thanks to our **Hybrid Architecture** and **High-quality training strategies**, it overcomes the traditional weakness of linear models in information-intensive tasks (e.g., OCR, Document Understanding), achieving results comparable to top-tier Transformer VLMs. | |
| <div align="center"> | |
| <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/performance1.png" width="100%" alt="Performance Comparison"> | |
| <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/performance2.png" width="100%" alt="Performance Comparison"> | |
| <br> | |
| <em>Figure 2: Comparison of InfiniteVL with existing VLMs on public multimodal understanding, real-world comprehension, text-rich, reasoning-centric multimodal benchmarks.</em> | |
| </div> | |
| <br> | |
| **Key Takeaways:** | |
| * **Best-in-Class Linear Model:** Significantly outperforms previous linear VLMs (Cobra, MaTVLM) by large margins (+40-60 points on DocVQA/OCRBench). | |
| * **Transformer-Level Quality:** Matches the performance of Qwen2.5-VL-3B on complex reasoning and text-rich tasks while being significantly faster in long contexts. | |
| ## Model Zoo | |
| We release two versions of InfiniteVL-4B to cater to different application scenarios. | |
| | Model | Stage | Description | Training context Length | Download | | |
| | :--- | :---: | :--- | :---: | :---: | | |
| | **InfiniteVL-4B** | **Stage 2** | **Best Generalist / Base.** The checkpoint directly after Instruction SFT. It delivers the **peak foundational performance** on standard multimodal benchmarks (e.g., OCR, MMMU, MathVista) and preserves the most robust knowledge. | 8K | [🤗 Hugging Face](https://huggingface.co/hustvl/InfiniteVL) | | |
| | **InfiniteVL-4B-LongSFT** | **Stage 3** | **Long-Context Adapted.** Fine-tuned using only a **small amount** of long-sequence multimodal data. It successfully activates length generalization for streaming scenarios, though its full potential on extreme contexts is not yet fully exploited. | 32K | [🤗 Hugging Face](https://huggingface.co/hustvl/InfiniteVL-LongSFT) | | |
| > **💡 Recommendations:** | |
| > | |
| > * **For Long-Context Inference:** Please use the **Stage 3** model. It enables stable streaming inference and avoids memory explosion. | |
| > * **For Training / Fine-tuning:** We strongly recommend using the **Stage 2** model as your starting point. Since it maintains the strongest general capabilities and hasn't shifted towards the specific long-context distribution, it serves as the best foundation for adaptation to new tasks or domains. | |
| ## Getting Started | |
| ### 🛠️ Environment Setup | |
| We recommend using **Anaconda** or **Miniconda** to manage the environment. The code is tested on **Python 3.11** + **PyTorch 2.6.0** + **CUDA 12.1**. | |
| **1. Create and activate a virtual environment:** | |
| ```bash | |
| conda create -n infinitevl python=3.11 -y | |
| conda activate infinitevl | |
| ``` | |
| **2. Install Environment:** | |
| The core environments are list as follows: | |
| ```bash | |
| # --- Core Deep Learning --- | |
| torch==2.6.0 | |
| torchvision==0.21.0 | |
| torchaudio==2.6.0 | |
| transformers==4.57.0 | |
| accelerate==1.8.1 | |
| # --- Vision & Multimodal --- | |
| qwen-vl-utils==0.0.11 | |
| decord==0.6.0 | |
| opencv-python==4.11.0.86 | |
| pillow==10.4.0 | |
| timm==1.0.22 | |
| einops==0.8.1 | |
| # --- Linear Attention & Kernels (Critical) --- | |
| # Note: These often require specific CUDA environments to build | |
| flash-attn==2.7.4.post1 | |
| flash-linear-attention==0.4.0 | |
| fla-core==0.4.0 | |
| causal-conv1d==1.5.0.post5 | |
| triton==3.2.0 | |
| ``` | |
| ### Using 🤗 Transformers to Chat | |
| ```python | |
| import torch | |
| from transformers import AutoModelForCausalLM, AutoProcessor | |
| from qwen_vl_utils import process_vision_info | |
| # Load Model | |
| model_path = "hustvl/InfiniteVL" # Replace with your HF repo ID | |
| model = AutoModelForCausalLM.from_pretrained( | |
| model_path, | |
| torch_dtype=torch.bfloat16, | |
| device_map="auto", | |
| trust_remote_code=True | |
| ) | |
| processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) | |
| # Prepare Inputs | |
| messages = [ | |
| { | |
| "role": "user", | |
| "content": [ | |
| { | |
| "type": "image", | |
| "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", | |
| }, | |
| {"type": "text", "text": "Describe this image."}, | |
| ], | |
| } | |
| ] | |
| # Process Inputs | |
| text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) | |
| image_inputs, video_inputs = process_vision_info(messages) | |
| inputs = processor( | |
| text=[text], | |
| images=image_inputs, | |
| videos=video_inputs, | |
| padding=True, | |
| return_tensors="pt", | |
| ).to(model.device) | |
| # Generate | |
| generated_ids = model.generate(**inputs, max_new_tokens=128) | |
| generated_ids_trimmed = [ | |
| out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) | |
| ] | |
| output_text = processor.batch_decode( | |
| generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False | |
| ) | |
| print(output_text[0]) | |
| ``` | |
| <details> | |
| <summary><strong>🖼️ Multi-Image Inference (Click to expand)</strong></summary> | |
| InfiniteVL supports inputting multiple images in a single turn for comparison or storytelling. | |
| ```python | |
| messages = [ | |
| { | |
| "role": "user", | |
| "content": [ | |
| { | |
| "type": "image", | |
| "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", | |
| }, | |
| { | |
| "type": "image", | |
| "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", | |
| }, | |
| {"type": "text", "text": "What are the similarities between these two images?"}, | |
| ], | |
| } | |
| ] | |
| # Process | |
| text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) | |
| image_inputs, video_inputs = process_vision_info(messages) | |
| inputs = processor( | |
| text=[text], | |
| images=image_inputs, | |
| videos=video_inputs, | |
| padding=True, | |
| return_tensors="pt", | |
| ).to(model.device) | |
| # Generate | |
| generated_ids = model.generate(**inputs, max_new_tokens=128) | |
| generated_ids_trimmed = [ | |
| out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) | |
| ] | |
| print(processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0]) | |
| ``` | |
| </details> | |
| <details> | |
| <summary><strong>🎥 Video Inference (Click to expand)</strong></summary> | |
| ```python | |
| messages = [ | |
| { | |
| "role": "user", | |
| "content": [ | |
| { | |
| "type": "video", | |
| "video": "file:///path/to/video.mp4", | |
| "max_pixels": 360 * 420, | |
| "fps": 1.0, | |
| }, | |
| {"type": "text", "text": "Describe this video."}, | |
| ], | |
| } | |
| ] | |
| # Process | |
| text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) | |
| image_inputs, video_inputs = process_vision_info(messages) | |
| inputs = processor( | |
| text=[text], | |
| images=image_inputs, | |
| videos=video_inputs, | |
| padding=True, | |
| return_tensors="pt", | |
| ).to(model.device) | |
| # Generate | |
| generated_ids = model.generate(**inputs, max_new_tokens=128) | |
| generated_ids_trimmed = [ | |
| out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) | |
| ] | |
| print(processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0]) | |
| ``` | |
| </details> | |
| ## 🚀 Advanced Usage: CUDA Graph Acceleration | |
| Unlike Transformer-based VLMs where the KV cache grows dynamically, **InfiniteVL maintains a constant-size memory state**. This unique property allows us to use **CUDA Graphs** to capture the entire computation graph for both streaming prefill and decoding, eliminating kernel launch overheads and maximizing GPU utilization. | |
| This is the key technology behind our **24 FPS** real-time streaming performance. | |
| ### ⚡ Accelerated Streaming Inference | |
| Unlike Transformer-based VLMs where the KV cache grows dynamically, **InfiniteVL maintains a constant-size memory state**. This unique property allows us to use **CUDA Graphs** to capture the entire computation graph for streaming prefill, eliminating kernel launch overheads. | |
| We provide a complete script in [`examples/demo_streaming_inference.py`](examples/demo_streaming_inference.py) to demonstrate this capability. | |
| > **🎥 Simulation Note:** This script **simulates a real-time streaming scenario** by reading a local video file frame-by-frame. It treats the video as a continuous data stream, updating the global linear memory state on-the-fly without retraining. | |
| > | |
| > **⚠️ Requirement:** This demo relies on the specialized model implementation (supporting `StaticCachePrealloc` and CUDA Graphs) located in the **[`infinitevl/infinitevl_streaming`](infinitevl/infinitevl_streaming)** directory. Please ensure your environment is set up correctly to import these modules. | |
| #### 1. Run the Simulation Demo | |
| ```bash | |
| # Make sure you are in the project root | |
| python examples/demo_streaming_inference.py \ | |
| --model_path /path/to/InfiniteVL-4B \ | |
| --video_path assets/demo.mp4 \ | |
| --fps 30 | |
| ``` | |
| ### ⚡ Accelerated Decode | |
| In addition to streaming prefill, InfiniteVL natively supports **CUDA Graph-accelerated decoding**. By capturing the decoding step into a static graph, we can achieve extremely low-latency token generation, further enhancing the responsiveness of real-time interactions. | |
| > 🚧 **Coming Soon:** The code for accelerated decoding is currently being refactored and cleaned up. We are working hard to release it as soon as possible. Please stay tuned! | |
| ## Qualitative Analysis & Visualization | |
| We provide visualization cases to demonstrate InfiniteVL's robust performance across diverse scenarios, ranging from information-intensive static tasks to ultra-long streaming video understanding. | |
| ### 1. Fundamental Visual-Language Capabilities (OCR & Reasoning) | |
| InfiniteVL effectively overcomes the traditional limitations of linear attention in detailed visual perception. By combining Sliding Window Attention with Gated DeltaNet, it excels at **Dense Text Recognition (OCR), Chart Interpretation, and Complex Scene Description**, delivering performance comparable to full-attention Transformers. | |
| <div align="center"> | |
| <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/image_case1_01.png" width="80%" alt="Fundamental Capabilities"> | |
| </div> | |
| ### 2. Long-Term Streaming Understanding | |
| The core strength of InfiniteVL lies in its ability to maintain coherent memory over **unlimited input streams**. | |
| The examples below demonstrate a continuous street-view video stream. InfiniteVL maintains a constant memory state and accurately answers questions at various timestamps (e.g., Frame 3100, ~1M tokens processed), recalling specific details like "NBC Studios" text or the color of a pedestrian's bag without forgetting. | |
| <div align="center"> | |
| <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/streaming_case1_01.png" width="80%" alt="Streaming Capabilities"> | |
| <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/streaming_case2_01.png" width="80%" alt="Streaming Capabilities"> | |
| </div> | |
| ## Contact | |
| If you have any questions, please contact Hongyuan Tao via email (hongyuantao@hust.edu.cn). | |
| ## Citation | |
| If you find InfiniteVL useful for your research or applications, please consider citing our paper: | |
| ```bibtex | |
| @article{tao2025infinitevl, | |
| title={InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models}, | |
| author={Tao, Hongyuan and Liao, Bencheng and Chen, Shaoyu and Yin, Haoran and Zhang, Qian and Liu, Wenyu and Wang, Xinggang}, | |
| journal={arXiv preprint}, | |
| year={2025} | |
| } | |
| ``` | |
| ## Acknowledgement | |
| InfiniteVL is built upon the giants of the open-source community. We would like to express our gratitude to: | |
| * **[Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL)**: For providing a powerful vision-language codebase and vision encoder. | |
| * **[Gated DeltaNet](https://github.com/sustcsonglin/flash-linear-attention)**: For the efficient linear attention mechanism and CUDA kernel implementations (FLA). | |
| * **Open-Source Datasets**: We sincerely thank the creators of the high-quality datasets used in our training, including **FineVision, LLaVA-OneVision, PixMo, The Cauldron, Docmatix, LLaVA-Video**, and others. Their contributions are essential to the development of efficient multimodal models. |