Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -19,7 +19,7 @@ This model is for research and development only.
|
|
| 19 |
|
| 20 |
### License/Terms of Use: <br>
|
| 21 |
|
| 22 |
-
CC-BY-NC-SA-4.0
|
| 23 |
|
| 24 |
### Deployment Geography:
|
| 25 |
|
|
@@ -52,28 +52,28 @@ The model is from the paper [Scaling Vision Pre-Training to 4K Resolution](https
|
|
| 52 |
**This model was developed based on [PS3-4K-SigLIP](https://huggingface.co/nvidia/PS3-4K-SigLIP) <br>
|
| 53 |
|
| 54 |
### Input: <br>
|
| 55 |
-
**Input Type(s):** Image and
|
| 56 |
**Input Format:** Red, Green, Blue (RGB) and strings <br>
|
| 57 |
-
**Input Parameters:** 2D and 1D <br>
|
| 58 |
**Other Properties Related to Input:** Image resolutions up to 3780*3780 and text input up to 12288 tokens <br>
|
| 59 |
|
| 60 |
### Output: <br>
|
| 61 |
**Output Type(s):** Text <br>
|
| 62 |
**Output Format:** Strings <br>
|
| 63 |
-
**Output Parameters:** 1D <br>
|
| 64 |
**Other Properties Related to Output:** Text output up to 12288 tokens <br>
|
| 65 |
|
| 66 |
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>
|
| 67 |
|
| 68 |
## Software Integration:
|
| 69 |
**Runtime Engine(s):**
|
| 70 |
-
N/A <br>
|
| 71 |
|
| 72 |
**Supported Hardware Microarchitecture Compatibility:** <br>
|
| 73 |
NVIDIA Ampere <br>
|
| 74 |
NVIDIA Blackwell <br>
|
| 75 |
-
NVIDIA Jetson <br>
|
| 76 |
NVIDIA Hopper <br>
|
|
|
|
| 77 |
|
| 78 |
**Preferred/Supported Operating System(s):** <br>
|
| 79 |
Linux <br>
|
|
@@ -93,6 +93,10 @@ v1.0 - Initial release
|
|
| 93 |
|-----------------|----------------|-------------------------------------------------------------------------|
|
| 94 |
| VILA-HD-8B-PS3-1.5K-SigLIP | 1512 * 1512 | [nvidia/VILA-HD-8B-PS3-1.5K-SigLIP](https://huggingface.co/nvidia/VILA-HD-8B-PS3-1.5K-SigLIP) |
|
| 95 |
| VILA-HD-8B-PS3-4K-SigLIP | 3780 * 3780 | [nvidia/VILA-HD-8B-PS3-4K-SigLIP](https://huggingface.co/nvidia/VILA-HD-8B-PS3-4K-SigLIP) |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 96 |
|
| 97 |
|
| 98 |
## Training Datasets: <br>
|
|
@@ -118,7 +122,10 @@ See [Dataset Preparation](https://arxiv.org/abs/2412.04468) for more details.
|
|
| 118 |
|
| 119 |
## Performance
|
| 120 |
|
| 121 |
-

|
| 23 |
|
| 24 |
### Deployment Geography:
|
| 25 |
|
|
|
|
| 52 |
**This model was developed based on [PS3-4K-SigLIP](https://huggingface.co/nvidia/PS3-4K-SigLIP) <br>
|
| 53 |
|
| 54 |
### Input: <br>
|
| 55 |
+
**Input Type(s):** Image and Text <br>
|
| 56 |
**Input Format:** Red, Green, Blue (RGB) and strings <br>
|
| 57 |
+
**Input Parameters:** Two Dimensional (2D) and One Dimensional (1D) <br>
|
| 58 |
**Other Properties Related to Input:** Image resolutions up to 3780*3780 and text input up to 12288 tokens <br>
|
| 59 |
|
| 60 |
### Output: <br>
|
| 61 |
**Output Type(s):** Text <br>
|
| 62 |
**Output Format:** Strings <br>
|
| 63 |
+
**Output Parameters:** One Dimensional (1D) <br>
|
| 64 |
**Other Properties Related to Output:** Text output up to 12288 tokens <br>
|
| 65 |
|
| 66 |
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>
|
| 67 |
|
| 68 |
## Software Integration:
|
| 69 |
**Runtime Engine(s):**
|
| 70 |
+
Not Applicable (N/A) <br>
|
| 71 |
|
| 72 |
**Supported Hardware Microarchitecture Compatibility:** <br>
|
| 73 |
NVIDIA Ampere <br>
|
| 74 |
NVIDIA Blackwell <br>
|
|
|
|
| 75 |
NVIDIA Hopper <br>
|
| 76 |
+
NVIDIA Jetson <br>
|
| 77 |
|
| 78 |
**Preferred/Supported Operating System(s):** <br>
|
| 79 |
Linux <br>
|
|
|
|
| 93 |
|-----------------|----------------|-------------------------------------------------------------------------|
|
| 94 |
| VILA-HD-8B-PS3-1.5K-SigLIP | 1512 * 1512 | [nvidia/VILA-HD-8B-PS3-1.5K-SigLIP](https://huggingface.co/nvidia/VILA-HD-8B-PS3-1.5K-SigLIP) |
|
| 95 |
| VILA-HD-8B-PS3-4K-SigLIP | 3780 * 3780 | [nvidia/VILA-HD-8B-PS3-4K-SigLIP](https://huggingface.co/nvidia/VILA-HD-8B-PS3-4K-SigLIP) |
|
| 96 |
+
| VILA-HD-8B-PS3-1.5K-C-RADIOv2 | 1536 * 1536 | [nvidia/VILA-HD-8B-PS3-1.5K-C-RADIOv2](https://huggingface.co/nvidia/VILA-HD-8B-PS3-1.5K-C-RADIOv2) |
|
| 97 |
+
| VILA-HD-8B-PS3-4K-C-RADIOv2 | 3840 * 3840 | [nvidia/VILA-HD-8B-PS3-4K-C-RADIOv2](https://huggingface.co/nvidia/VILA-HD-8B-PS3-4K-C-RADIOv2) |
|
| 98 |
+
| VILA-HD-8B-PS3-1.5K-SigLIP2 | 1512 * 1512 | [nvidia/VILA-HD-8B-PS3-1.5K-SigLIP2](https://huggingface.co/nvidia/VILA-HD-8B-PS3-1.5K-SigLIP2) |
|
| 99 |
+
| VILA-HD-8B-PS3-4K-SigLIP2 | 3780 * 3780 | [nvidia/VILA-HD-8B-PS3-4K-SigLIP2](https://huggingface.co/nvidia/VILA-HD-8B-PS3-4K-SigLIP2) |
|
| 100 |
|
| 101 |
|
| 102 |
## Training Datasets: <br>
|
|
|
|
| 122 |
|
| 123 |
## Performance
|
| 124 |
|
| 125 |
+

|
| 126 |
+
|
| 127 |
+

|
| 128 |
+
|
| 129 |
|
| 130 |
|
| 131 |
|
|
|
|
| 176 |
journal={arXiv preprint arXiv:2503.19903},
|
| 177 |
year={2025}
|
| 178 |
}
|
| 179 |
+
```
|
| 180 |
+
|