nvidia
/

VILA-HD-8B-PS3-4K-SigLIP

@@ -19,7 +19,7 @@ This model is for research and development only.
 ### License/Terms of Use: <br>
-CC-BY-NC-SA-4.0
 ### Deployment Geography:
@@ -52,28 +52,28 @@ The model is from the paper [Scaling Vision Pre-Training to 4K Resolution](https
 **This model was developed based on [PS3-4K-SigLIP](https://huggingface.co/nvidia/PS3-4K-SigLIP) <br>
 ### Input: <br>
-**Input Type(s):** Image and text <br>
 **Input Format:** Red, Green, Blue (RGB) and strings <br>
-**Input Parameters:** 2D and 1D <br>
 **Other Properties Related to Input:** Image resolutions up to 3780*3780 and text input up to 12288 tokens <br>
 ### Output: <br>
 **Output Type(s):** Text <br>
 **Output Format:** Strings <br>
-**Output Parameters:** 1D <br>
 **Other Properties Related to Output:** Text output up to 12288 tokens <br>
 Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>
 ## Software Integration:
 **Runtime Engine(s):**
-N/A <br>
 **Supported Hardware Microarchitecture Compatibility:** <br>
 NVIDIA Ampere <br>
 NVIDIA Blackwell <br>
-NVIDIA Jetson  <br>
 NVIDIA Hopper <br>
 **Preferred/Supported Operating System(s):** <br>
 Linux <br>
@@ -93,6 +93,10 @@ v1.0 - Initial release
 |-----------------|----------------|-------------------------------------------------------------------------|
 | VILA-HD-8B-PS3-1.5K-SigLIP | 1512 * 1512    | [nvidia/VILA-HD-8B-PS3-1.5K-SigLIP](https://huggingface.co/nvidia/VILA-HD-8B-PS3-1.5K-SigLIP) |
 | VILA-HD-8B-PS3-4K-SigLIP   | 3780 * 3780    | [nvidia/VILA-HD-8B-PS3-4K-SigLIP](https://huggingface.co/nvidia/VILA-HD-8B-PS3-4K-SigLIP)     |
 ## Training Datasets: <br>
@@ -118,7 +122,10 @@ See [Dataset Preparation](https://arxiv.org/abs/2412.04468) for more details.
 ## Performance
-![Performance of VILA-HD models](assets/vila_hd_results.png)
@@ -169,4 +176,5 @@ If you find this work useful in your research, please consider citing:
   journal={arXiv preprint arXiv:2503.19903},
   year={2025}
 }
-```

 ### License/Terms of Use: <br>
+[CC-BY-NC-SA-4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en)
 ### Deployment Geography:
 **This model was developed based on [PS3-4K-SigLIP](https://huggingface.co/nvidia/PS3-4K-SigLIP) <br>
 ### Input: <br>
+**Input Type(s):** Image and Text <br>
 **Input Format:** Red, Green, Blue (RGB) and strings <br>
+**Input Parameters:** Two Dimensional (2D) and One Dimensional (1D) <br>
 **Other Properties Related to Input:** Image resolutions up to 3780*3780 and text input up to 12288 tokens <br>
 ### Output: <br>
 **Output Type(s):** Text <br>
 **Output Format:** Strings <br>
+**Output Parameters:** One Dimensional (1D) <br>
 **Other Properties Related to Output:** Text output up to 12288 tokens <br>
 Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>
 ## Software Integration:
 **Runtime Engine(s):**
+Not Applicable (N/A) <br>
 **Supported Hardware Microarchitecture Compatibility:** <br>
 NVIDIA Ampere <br>
 NVIDIA Blackwell <br>
 NVIDIA Hopper <br>
+NVIDIA Jetson  <br>
 **Preferred/Supported Operating System(s):** <br>
 Linux <br>
 |-----------------|----------------|-------------------------------------------------------------------------|
 | VILA-HD-8B-PS3-1.5K-SigLIP | 1512 * 1512    | [nvidia/VILA-HD-8B-PS3-1.5K-SigLIP](https://huggingface.co/nvidia/VILA-HD-8B-PS3-1.5K-SigLIP) |
 | VILA-HD-8B-PS3-4K-SigLIP   | 3780 * 3780    | [nvidia/VILA-HD-8B-PS3-4K-SigLIP](https://huggingface.co/nvidia/VILA-HD-8B-PS3-4K-SigLIP)     |
+| VILA-HD-8B-PS3-1.5K-C-RADIOv2 | 1536 * 1536    | [nvidia/VILA-HD-8B-PS3-1.5K-C-RADIOv2](https://huggingface.co/nvidia/VILA-HD-8B-PS3-1.5K-C-RADIOv2) |
+| VILA-HD-8B-PS3-4K-C-RADIOv2   | 3840 * 3840    | [nvidia/VILA-HD-8B-PS3-4K-C-RADIOv2](https://huggingface.co/nvidia/VILA-HD-8B-PS3-4K-C-RADIOv2)     |
+| VILA-HD-8B-PS3-1.5K-SigLIP2 | 1512 * 1512    | [nvidia/VILA-HD-8B-PS3-1.5K-SigLIP2](https://huggingface.co/nvidia/VILA-HD-8B-PS3-1.5K-SigLIP2) |
+| VILA-HD-8B-PS3-4K-SigLIP2   | 3780 * 3780    | [nvidia/VILA-HD-8B-PS3-4K-SigLIP2](https://huggingface.co/nvidia/VILA-HD-8B-PS3-4K-SigLIP2)     |
 ## Training Datasets: <br>
 ## Performance
+![Performance of VILA-HD models 1](assets/vila_hd_results_1.png)
+![Performance of VILA-HD models 2](assets/vila_hd_results_2.png)
   journal={arXiv preprint arXiv:2503.19903},
   year={2025}
 }
+```