Description:
VILA-HD-8B-PS3-4K-SigLIP is a Multi-modal Large Language Model that understands and answers questions about images of up to 4K resolution.
This model is for research and development only.
License/Terms of Use:
CC-BY-NC-SA-4.0
Deployment Geography:
Global
Use Case:
The model is used for extracting visual features from high-resolution images.
Release Date:
Huggingface [05/30/2025] via [https://huggingface.co/nvidia/VILA-HD-8B-PS3-4K-SigLIP]
Reference(s):
The model is from the paper Scaling Vision Pre-Training to 4K Resolution. Useful links:
Model Architecture:
Architecture Type: Neural Network
Network Architecture: Multi-modal Large Language Model designed for high-resolution images
**This model was developed based on PS3-4K-SigLIP
Input:
Input Type(s): Image and text
Input Format: Red, Green, Blue (RGB) and strings
Input Parameters: 2D and 1D
Other Properties Related to Input: Image resolutions up to 3780*3780 and text input up to 12288 tokens
Output:
Output Type(s): Text
Output Format: Strings
Output Parameters: 1D
Other Properties Related to Output: Text output up to 12288 tokens
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Software Integration:
Runtime Engine(s):
N/A
Supported Hardware Microarchitecture Compatibility:
NVIDIA Ampere
NVIDIA Blackwell
NVIDIA Jetson
NVIDIA Hopper
Preferred/Supported Operating System(s):
Linux
Linux 4 Tegra
QNX
Windows
Model Version(s):
v1.0 - Initial release
Pre-Trained Models
VILA-HD models
Vision Model | Max Resolution | Pre-Trained Weights |
---|---|---|
VILA-HD-8B-PS3-1.5K-SigLIP | 1512 * 1512 | nvidia/VILA-HD-8B-PS3-1.5K-SigLIP |
VILA-HD-8B-PS3-4K-SigLIP | 3780 * 3780 | nvidia/VILA-HD-8B-PS3-4K-SigLIP |
Training Datasets:
72 datasets. See Dataset Preparation for more details.
Dataset partition: Training 100%
Training Dataset:
Link: See Dataset Preparation for more details.
Data Collection Method by dataset:
[Hybrid: Automated, Human]
Labeling Method by dataset:
[Hybrid: Automated, Human]
Properties (Quantity, Dataset Descriptions, Sensor(s)):
72 datasets splitted into 5 stages (Projector Alignment, Vision Encoder Alignment, Pre-Training, Image Instruction-Tuning, and Patch Selection Tuning)
Performance
Inference:
Acceleration Engine: N/A
Test Hardware:
The model is tested on NVIDIA A100 GPU.
Inference instructions:
First install VILA, following the instructions here.
Then install PS3, following the instructions in PS3 repo.
VILA-HD inference shares the same API as VILA (see here). Specifically, we provide vila-infer
as a CLI tool to infer with VILA-HD models. As an example:
vila-infer --model-path nvidia/VILA-HD-8B-PS3-4K-SigLIP --conv-mode auto --text "Where does the exit lead to?" --media assets/av_example_1.jpg
VILA-HD has several arguments controlling the total number of high-res patches to process, the number of high-res patches to process at each scale, the mode of patch selection, etc. These can be controlled by setting the following environment variables during inference:
NUM_LOOK_CLOSE
: How many times to run high-res encoding. Each time PS3 encodes 2560 high-res patches. Can be set between 1 and 6 for 1.5K model and between 1 and 35 for 4K model. Default is 6 for both models.NUM_TOKEN_LOOK_CLOSE
: How many high-res patches to encode. Provides more fine-grained control of # high-res patches thanNUM_LOOK_CLOSE
. Can be set between 1 and 14580 for 1.5K model and between 1 and 87480 for 4K model. Setting this will overrideNUM_LOOK_CLOSE
. Default isNone
.SELECT_NUM_EACH_SCALE
: The number of high-res patches to encode at each high-res scale. For example, settingSELECT_NUM_EACH_SCALE=512+2048
for 1.5K model means the number of high-res patches to encode at 756 and 1512 scales are and 512 and 2048 respectively. By default, the number of patches at each scale is proportional to the number of total patches at that scale, i.e.,512+2048
for 1.5K model and85+340+2125
for 4K model.LOOK_CLOSE_MODE
: The mode of patch selection. Can be set asafter_prompt
orafter_image
.after_prompt
means the high-res patches are selected based on the text prompt.after_image
means the high-res patches are selected based on image saliency. Default isafter_prompt
.SMOOTH_SELECTION_PROB
: Whether to use smooth selection probability during high-res patch selection. Can be set as 'true' or 'false'. Default isfalse
.
For example, if you want to make VILA-HD to run high-res encoding for 12 times for better accuracy, you can set NUM_LOOK_CLOSE=12
when running inference:
NUM_LOOK_CLOSE=12 vila-infer --model-path nvidia/VILA-HD-8B-PS3-4K-SigLIP --conv-mode auto --text "Where does the exit lead to?" --media assets/av_example_1.jpg
Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns here.
Citation
If you find this work useful in your research, please consider citing:
@article{shi2025scaling,
title={Scaling Vision Pre-Training to 4K Resolution},
author={Shi, Baifeng and Li, Boyi and Cai, Han and Lu, Yao and Liu, Sifei and Pavone, Marco and Kautz, Jan and Han, Song and Darrell, Trevor and Molchanov, Pavlo and others},
journal={arXiv preprint arXiv:2503.19903},
year={2025}
}
- Downloads last month
- 36