nvidia/VILA-HD-8B-PS3-4K-SigLIP

Description:

VILA-HD-8B-PS3-4K-SigLIP is a Multi-modal Large Language Model that understands and answers questions about images of up to 4K resolution.

This model is for research and development only.

License/Terms of Use:

CC-BY-NC-SA-4.0

Deployment Geography:

Global

Use Case:

The model is used for extracting visual features from high-resolution images.

Release Date:

Huggingface [05/30/2025] via [https://huggingface.co/nvidia/VILA-HD-8B-PS3-4K-SigLIP]

Reference(s):

The model is from the paper Scaling Vision Pre-Training to 4K Resolution. Useful links:

Model Architecture:

Architecture Type: Neural Network

Network Architecture: Multi-modal Large Language Model designed for high-resolution images

**This model was developed based on PS3-4K-SigLIP

Input:

Input Type(s): Image and text
Input Format: Red, Green, Blue (RGB) and strings
Input Parameters: 2D and 1D
Other Properties Related to Input: Image resolutions up to 3780*3780 and text input up to 12288 tokens

Output:

Output Type(s): Text
Output Format: Strings
Output Parameters: 1D
Other Properties Related to Output: Text output up to 12288 tokens

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine(s): N/A

Supported Hardware Microarchitecture Compatibility:
NVIDIA Ampere
NVIDIA Blackwell
NVIDIA Jetson
NVIDIA Hopper

Preferred/Supported Operating System(s):
Linux
Linux 4 Tegra
QNX
Windows

Model Version(s):

v1.0 - Initial release

Pre-Trained Models

VILA-HD models

Vision Model	Max Resolution	Pre-Trained Weights
VILA-HD-8B-PS3-1.5K-SigLIP	1512 * 1512	nvidia/VILA-HD-8B-PS3-1.5K-SigLIP
VILA-HD-8B-PS3-4K-SigLIP	3780 * 3780	nvidia/VILA-HD-8B-PS3-4K-SigLIP

Training Datasets:

72 datasets. See Dataset Preparation for more details.

Dataset partition: Training 100%

Training Dataset:

Link: See Dataset Preparation for more details.

Data Collection Method by dataset:
[Hybrid: Automated, Human]

Labeling Method by dataset:
[Hybrid: Automated, Human]

Properties (Quantity, Dataset Descriptions, Sensor(s)):
72 datasets splitted into 5 stages (Projector Alignment, Vision Encoder Alignment, Pre-Training, Image Instruction-Tuning, and Patch Selection Tuning)

Performance

Inference:

Acceleration Engine: N/A
Test Hardware:
The model is tested on NVIDIA A100 GPU.

Inference instructions:

First install VILA, following the instructions here.

Then install PS3, following the instructions in PS3 repo.

VILA-HD inference shares the same API as VILA (see here). Specifically, we provide vila-infer as a CLI tool to infer with VILA-HD models. As an example:

vila-infer --model-path nvidia/VILA-HD-8B-PS3-4K-SigLIP --conv-mode auto --text "Where does the exit lead to?" --media assets/av_example_1.jpg

VILA-HD has several arguments controlling the total number of high-res patches to process, the number of high-res patches to process at each scale, the mode of patch selection, etc. These can be controlled by setting the following environment variables during inference:

NUM_LOOK_CLOSE: How many times to run high-res encoding. Each time PS3 encodes 2560 high-res patches. Can be set between 1 and 6 for 1.5K model and between 1 and 35 for 4K model. Default is 6 for both models.
NUM_TOKEN_LOOK_CLOSE: How many high-res patches to encode. Provides more fine-grained control of # high-res patches than NUM_LOOK_CLOSE. Can be set between 1 and 14580 for 1.5K model and between 1 and 87480 for 4K model. Setting this will override NUM_LOOK_CLOSE. Default is None.
SELECT_NUM_EACH_SCALE: The number of high-res patches to encode at each high-res scale. For example, setting SELECT_NUM_EACH_SCALE=512+2048 for 1.5K model means the number of high-res patches to encode at 756 and 1512 scales are and 512 and 2048 respectively. By default, the number of patches at each scale is proportional to the number of total patches at that scale, i.e., 512+2048 for 1.5K model and 85+340+2125 for 4K model.
LOOK_CLOSE_MODE: The mode of patch selection. Can be set as after_prompt or after_image. after_prompt means the high-res patches are selected based on the text prompt. after_image means the high-res patches are selected based on image saliency. Default is after_prompt.
SMOOTH_SELECTION_PROB: Whether to use smooth selection probability during high-res patch selection. Can be set as 'true' or 'false'. Default is false.

For example, if you want to make VILA-HD to run high-res encoding for 12 times for better accuracy, you can set NUM_LOOK_CLOSE=12 when running inference:

NUM_LOOK_CLOSE=12 vila-infer --model-path nvidia/VILA-HD-8B-PS3-4K-SigLIP --conv-mode auto --text "Where does the exit lead to?" --media assets/av_example_1.jpg

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns here.

Citation

If you find this work useful in your research, please consider citing:

@article{shi2025scaling,
  title={Scaling Vision Pre-Training to 4K Resolution},
  author={Shi, Baifeng and Li, Boyi and Cai, Han and Lu, Yao and Liu, Sifei and Pavone, Marco and Kautz, Jan and Han, Song and Darrell, Trevor and Molchanov, Pavlo and others},
  journal={arXiv preprint arXiv:2503.19903},
  year={2025}
}

nvidia
/

VILA-HD-8B-PS3-4K-SigLIP