nvidia
/

NVIDIA-Nemotron-Nano-12B-v2-VL-BF16

@@ -31,7 +31,9 @@ Nemotron Nano 12B V2 VL is a model for multi-modal document intelligence. It wou
 ### Release Date:  <br>
 - Build.Nvidia.com [October 28th, 2025] via [nvidia/NVIDIA-Nemotron-Nano-VL-12B-V2](https://build.nvidia.com/nvidia/nvidia-nemotron-nano-vl-12b-v2)
-- Hugging Face [October 28th, 2025] via [nvidia/NVIDIA-Nemotron-Nano-VL-12B-V2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16)
@@ -389,297 +391,54 @@ Additional processing for several datasets included rule-based QA generation (e.
 ** Image based datasets were all scanned against known CSAM to make sure no such content was included in training.<br>
 # Public Datasets <br>
-| Dataset Name | Type | Modalities | Number of Samples | Size |
-|--------------|------|------------|-------------------|------|
-| Captioning on Open Images (subset, relabeled) | VQA | image, text | 1'278'221 | 378.34 GB |
-| Localized Narratives (subset, relabeled) | VQA | image, text | 503'275 | 147.67 GB |
-| TextCaps (subset) | Image Captioning | image, text | 21'953 | 5.76 GB |
-| TextCaps (subset) | Image Captioning | image, text | 109'765 | 28.81 GB |
-| TextVQA (subset) | Image Captioning | image, text | 34'602 | 9.08 GB |
-| RefCoco | Referring Expression Grounding | image, text | 14'694 | 2.39 GB |
-| VQAv2 | VQA | image, text | 28'555 | 4.41 GB |
-| AOKVQA | VQA | image, text | 20'832 | 3.39 GB |
-| GQA | VQA | image, text | 21'433 | 2.94 GB |
-| AOKVQA | VQA | image, text | 16'131 | 2.62 GB |
-| synthdog-en | OCR | image, text | 29'672 | 2.31 GB |
-| WIT | Image Captioning | image, text | 538'916 | 745.24 GB |
-| CLEVR | Image Reasoning | image, text | 70'000 | 12.57 GB |
-| CLEVR-Math | Image Reasoning | image, text | 70'000 | 12.47 GB |
-| OpenAssistant (oasst1, oasst2) | Text Instruction Tuning | text | 47'118 | 0.09 GB |
-| VATEX | Video Captioning | video, text | 2'880 | 5.50 GB |
-| YouCook2 | Video Captioning | video, text | 36 | 0.17 GB |
-| VCG+ 112K | VideoQA | video, text | 164 | 2.82 GB |
-| Video Localized Narratives | Video Captioning | video, text | 373 | 0.64 GB |
-| CLEVRER | VQA | video, text | 40'000 | 46.05 GB |
-| NExT-QA | VideoQA | video, text | 10'368 | 57.06 GB |
-| CLEVRER | Video Reasoning | video, text | 42'620 | 49.10 GB |
-| ScreenQA | VQA | image, text | 302'004 | 30.52 GB |
-| WikiSQL | Image Reasoning | image, text | N/A | N/A |
-| WikiTableQuestions | TextQA | text | N/A | N/A |
-| RenderedText | OCR | image, text | N/A | N/A |
-| FinQA | Text Reasoning | text | N/A | N/A |
-| TAT-QA | Text Reasoning | text | N/A | N/A |
-| Databricks Dolly 15K | Text Instruction Tuning | text | N/A | N/A |
-| WebSight | Image Classification | image, text | N/A | N/A |
-| RAVEN | Image Reasoning | image, text | N/A | N/A |
-| VizWiz | VQA | image, text | N/A | N/A |
-| Inter-GPS | Image Reasoning | image, text | N/A | N/A |
-| OCR dataset from arXiv data | OCR | image, text | 120'000 | 49.99 GB |
-| OCR dataset from arXiv data | OCR | image, text | 599'927 | 249.93 GB |
-| OCR dataset from arXiv data | OCR | image, text | 1'565'011 | 1637.79 GB |
-| OCR dataset from arXiv data | OCR | image, text | 418'059 | 422.04 GB |
-| OCR dataset from arXiv data | OCR | image, text | 200'001 | 200.89 GB |
-| OCR dataset from arXiv data | OCR | image, text | 200'000 | 198.94 GB |
-| OCR dataset from arXiv data | OCR | image, text | 200'001 | 196.08 GB |
-| OCR dataset from arXiv data | OCR | image, text | 400'000 | 382.95 GB |
-| OCR dataset from arXiv data | OCR | image, text | 400'000 | 388.16 GB |
-| OCR dataset from arXiv data | OCR | image, text | 18'280 | 20.98 GB |
-| DocLayNet (curated) | OCR | image, text | 48'369 | 18.59 GB |
-| DocLayNet (curated & augmented) | OCR | image, text | 48'249 | 9.12 GB |
-| DocLayNet (curated & augmented) | OCR | image, text | 48'267 | 9.09 GB |
-| SynthTabNet | OCR | image, text | 200'000 | 9.70 GB |
-| OCR dataset based on pdfs from CommonCrawl | OCR | image, text | 14'309 | 17.00 GB |
-| OCR dataset based on pdfs from CommonCrawl | OCR | image, text | 8'461 | 7.77 GB |
-| OCR dataset based on pdfs from CommonCrawl | OCR | image, text | 8'462 | 7.99 GB |
-| OCR dataset based on pdfs from CommonCrawl | OCR | image, text | 14'236 | 5.84 GB |
-| OCR dataset based on pdfs from CommonCrawl | OCR | image, text | 14'232 | 5.92 GB |
-| SynthTables | OCR | image, text | 4'887 | 0.38 GB |
-| TabRecSet | OCR | image, text | 25'281 | 2.46 GB |
-| TabRecSet | OCR | image, text | 25'281 | 1.61 GB |
-| FinTabNet | OCR | image, text | 57'137 | 9.22 GB |
-| FinTabNet | OCR | image, text | 57'131 | 21.76 GB |
-| FinTabNet | OCR | image, text | 57'129 | 21.68 GB |
-| PubTables-1M | OCR | image, text | 224'170 | 29.55 GB |
-| PubTables-1M | OCR | image, text | 224'169 | 36.32 GB |
-| PubTables-1M | OCR | image, text | 225'108 | 36.45 GB |
-| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 37.13 GB |
-| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 33.38 GB |
-| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 32.85 GB |
-| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 31.15 GB |
-| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 30.30 GB |
-| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 38.40 GB |
-| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 27.09 GB |
-| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 29.52 GB |
-| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 30.49 GB |
-| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 30.14 GB |
-| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 100.14 GB |
-| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 93.82 GB |
-| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 93.96 GB |
-| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 90.61 GB |
-| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 89.89 GB |
-| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 95.75 GB |
-| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 85.65 GB |
-| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 91.01 GB |
-| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 90.29 GB |
-| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 84.66 GB |
-| TextOCR | OCR | image, text | 21'727 | 5.83 GB |
-| TextOCR | OCR | image, text | 21'138 | 2.83 GB |
-| Table OCR on pdfs from CommonCrawl | OCR | image, text | 19'359 | 12.92 GB |
-| Table OCR on pdfs from CommonCrawl | OCR | image, text | 19'351 | 14.57 GB |
-| Table OCR on pdfs from CommonCrawl | OCR | image, text | 19'350 | 14.44 GB |
-| HierText | OCR | image, text | 8'278 | 2.60 GB |
-| FUNSD | OCR | image, text | 149 | 0.01 GB |
-| Gretel Synthetic Safety Alignment | Safety | Text | 19'779 | 0.03 GB |
-| Internal safety alignment multimodal dataset | Safety | image, text | 22'559 | 8.27 GB |
-| ALFRED Action | Safety | video, text | 6'524 | 5.92 GB |
-| ALFRED Goal | Safety | video, text | 6'464 | 5.86 GB |
-| VQA-RAD | Safety | image, text | 1'793 | 0.09 GB |
-| SLAKE | Safety | image, text | 9'835 | 0.85 GB |
-| STEM MMLU-aux (subset) | Safety | text | 37'444 | 0.49 GB |
-| Glaive & Xlam | Function call | text | 8'000 | 0.02 GB |
-| Textbooks VQA | VQA | image, text | 46'745 | 10.85 GB |
-| ai2d | VQA | image, text | 12'413 | 2.23 GB |
-| ScienceQA | VQA | image, text | 12'716 | 0.39 GB |
-| ScienceQA from LlaVA-OneVision | VQA | image, text | 19'196 | 0.65 GB |
-| ChartQA | VQA | image, text | 15'121 | 0.68 GB |
-| ChartQA (augmented) | VQA | image, text | 15'050 | 0.65 GB |
-| ChartQA (CoT) | VQA | image, text | 23'571 | 1.04 GB |
-| ChartQA | VQA | image, text | 60'438 | 2.69 GB |
-| Geo170K | VQA | image, text | 13'263 | 0.07 GB |
-| InfographicVQA | VQA | image, text | 23'946 | 8.21 GB |
-| DocVQA | VQA | image, text | 39'463 | 26.29 GB |
-| DocVQA (CoT) | Image Reasoning | image, text | 16'881 | 10.65 GB |
-| ALLaVA-4V (subset) | Visual Instruction Tuning | image, text | 524'892 | 96.99 GB |
-| ALLaVA-4V (subset) | Visual Instruction Tuning | image, text | 227'776 | 42.52 GB |
-| TabMWP | Image Reasoning | image, text | 23'058 | 0.30 GB |
-| PMC-VQA | VQA | image, text | 2'266 | 0.04 GB |
-| OCR-VQA from The Cauldron | VQA | image, text | 165'746 | 5.79 GB |
-| ST-VQA from The Cauldron | VQA | image, text | 17'232 | 0.68 GB |
-| WebSight from The Cauldron | OCR | image, text | 9'809 | 1.84 GB |
-| EST-VQA | VQA | image, text | 17'043 | 4.25 GB |
-| TAL Handwritten English OCR | OCR | image, text | 9'998 | 0.22 GB |
-| TAL Handwritten Math writing | OCR | image, text | 22'244 | 0.33 GB |
-| SlideVQA | VQA | image, text | 5'773 | 0.42 GB |
-| pixmo-docs | VQA | image, text | 251'165 | 34.88 GB |
-| pixmo-cap | Image Captioning | image, text | 706'897 | 261.63 GB |
-| pixmo-cap-qa | VQA | image, text | 214'978 | 56.72 GB |
-| pixmo-ask-model-anything | Visual Instruction Tuning | image, text | 153'592 | 20.50 GB |
-| TallyQA | VQA | image, text | 68'775 | 10.64 GB |
-| Bounding box to text annotations on a subset of Open Images | VQA | image, text | 1'664'533 | 490.37 GB |
-| Bounding box to text annotations on a subset of Open Images | VQA | image, text | 1'664'533 | 488.17 GB |
-| Bounding box to text annotations on a subset of Open Images | VQA | image, text | 1'128'326 | 324.46 GB |
-| TabMWP (CoT) | Image Reasoning | image, text | 20'305 | 0.28 GB |
-| VisualWebInstruct | Visual Instruction Tuning | image, text | 260'419 | 7.41 GB |
-| Internal collection of public text SFT datasets | Text Instruction Tuning | text | 197'938 | 1.04 GB |
-| ReCTS from ICDAR2019 | OCR | image, text | 20'000 | 1.77 GB |
-| RCTW from ICDAR2017 | OCR | image, text | 8'034 | 7.85 GB |
-| OCR equation heavy dataset from arXiv data | OCR | image, text | 2'000 | 0.03 GB |
-| Mulberry-SFT (CoT) | Image Reasoning | image, text | 191'332 | 30.80 GB |
-| LLaVA-CoT-100k (CoT) | Image Reasoning | image, text | 63'013 | 8.18 GB |
-| GeomVerse (CoT) | Image Reasoning | image, text | 9'298 | 0.90 GB |
-| MapQA (CoT) | Image Reasoning | image, text | 16'832 | 1.77 GB |
-| MetaMathQA (CoT) | Text Reasoning | text | 225'408 | 4.55 GB |
-| MetaMathQA (CoT) | Image Reasoning | image, text | 220'544 | 4.48 GB |
-| PlotQA (CoT) | Image Reasoning | image, text | 16'256 | 0.76 GB |
-| Visual7W Telling (CoT) | Image Reasoning | image, text | 62'592 | 3.21 GB |
-| Visual7W Pointing | VQA | image, text | 25'733 | 0.93 GB |
-| VisText | Image Captioning | image, text | 9'969 | 0.52 GB |
-| ScreenQA | VQA | image, text | 32'724 | 3.51 GB |
-| wave-ui-25k | OCR | image, text | 24'978 | 11.44 GB |
-| Charts2500 | VQA | image, text | 2'486 | 0.09 GB |
-| Cyrillic | OCR | image, text | 72'284 | 1.49 GB |
-| CMM-Math | Image Reasoning | image, text | 13'148 | 0.05 GB |
-| SimChart9K | Image Reasoning | image, text | 9'536 | 0.69 GB |
-| UniChart | Image Reasoning | image, text | 504'885 | 17.04 GB |
-| CASIA-HWDB2-line | OCR | image, text | 2'193 | 0.09 GB |
-| MMTab | VQA | image, text | 232'746 | 59.23 GB |
-| ArxivQA | VQA | image, text | 99'995 | 17.32 GB |
-| docmatix-single | VQA | image, text | 19'992 | 3.94 GB |
-| DocReason525K | Image Reasoning | image, text | 25'863 | 33.80 GB |
-| FigureQA | VQA | image, text | 100'000 | 2.37 GB |
-| LRV-Instruction | Visual Instruction Tuning | image, text | 7'198 | 0.37 GB |
-| VisualWebInstruct (CoT) | Image Reasoning | image, text | 48'929 | 4.37 GB |
-| DocMatix (multi-page) | Image Reasoning | image, text | 19'969 | 8.66 GB |
-| spot-the-diff | Image Reasoning | image, text | 8'007 | 1.45 GB |
-| DocVQA (CoT) | Image Reasoning | image, text | 36'333 | 24.32 GB |
-| DocVQA (CoT) | Image Reasoning | image, text | 45'710 | 2.10 GB |
-| DocVQA (CoT) | Image Reasoning | image, text | 19'548 | 6.70 GB |
-| Mulberry-SFT (subset, CoT) | Image Reasoning | image, text | 103'763 | 18.45 GB |
-| UniGeo (CoT) | Image Reasoning | image, text | 9'728 | 0.05 GB |
-| NIGHTS | Image Reasoning | image, text | 12'906 | 37.01 GB |
-| Mantis-Instruct (CoT) | Image Reasoning | image, text | 67'723 | 13.86 GB |
-| OCR dataset based on pdfs from CommonCrawl | Image Reasoning | image, text | 2'858 | 1.23 GB |
-| OCR dataset based on pdfs from CommonCrawl | Image Reasoning | image, text | 586 | 0.46 GB |
-| FinTabNet (relabeled) | Image Reasoning | image, text | 8'356 | 3.17 GB |
-| Table OCR on pdfs from CommonCrawl | Image Reasoning | image, text | 4'846 | 3.65 GB |
-| HierText (relabeled for QA) | Image Reasoning | image, text | 514 | 0.07 GB |
-| ECD-10k-Images | Image Reasoning | image, text | 132'613 | 15.38 GB |
-| ActivityNet (open-ended QA) | VideoQA | video, text | 6'490 | 162.22 GB |
-| NExT-QA (multi-choice QA) | VideoQA | video, text | 5'496 | 11.07 GB |
-| NExT-QA (open-ended QA) | VideoQA | video, text | 5'492 | 10.99 GB |
-| NExT-QA (multi-choice QA) | VideoQA | video, text | 52 | 0.74 GB |
-| NExT-QA (open-ended QA) | VideoQA | video, text | 61 | 0.85 GB |
-| NExT-QA (open-ended QA) | VideoQA | video, text | 6'843 | 27.83 GB |
-| NExT-QA (multi-choice QA) | VideoQA | video, text | 6'843 | 27.85 GB |
-| ActivityNet (open-ended QA) | VideoQA | video, text | 7'420 | 102.81 GB |
-| ActivityNet (open-ended QA) | VideoQA | video, text | 3'840 | 25.84 GB |
-| NExT-QA (multi-choice QA) | VideoQA | video, text | 4'633 | 35.38 GB |
-| NExT-QA (open-ended QA) | VideoQA | video, text | 4'694 | 35.84 GB |
-| ActivityNet (open-ended QA) | VideoQA | video, text | 2'580 | 7.46 GB |
-| Perception Test (multi-choice QA) | VideoQA | video, text | 1'785 | 18.67 GB |
-| Perception Test (multi-choice QA) | VideoQA | video, text | 618 | 11.52 GB |
-| NExT-QA | VideoQA | video, text | 34'132 | 150.86 GB |
-| CLEVRER | VideoQA | video, text | 40'000 | 46.03 GB |
-| Video dataset based on Kinetics | VideoQA | video, text | 39'452 | 26.15 GB |
-| EGO4D | VideoQA | video, text | 7'797 | 3.38 GB |
-| TVQA | VideoQA | video, text | 34'868 | 100.05 GB |
-| EgoExoLearn | VideoQA | video, text | 36'373 | 8558.27 GB |
-| Video dataset based on Kinetics | VideoQA | video, text | 647'883 | 890.56 GB |
-| Mementos | VideoQA | video, text | 4'060 | 14.07 GB |
-| Perception Test | VideoQA | video, text | 7'392 | 94.95 GB |
-| ActivityNet | VideoQA | video, text | 10'021 | 191.49 GB |
-| EGO4D | VideoQA | video, text | 1'506 | 137.00 GB |
-| FineAction | VideoQA | video, text | 7'504 | 169.76 GB |
-| HACS | VideoQA | video, text | 31'223 | 829.25 GB |
-| HiREST | VideoQA | video, text | 822 | 42.50 GB |
-| Perception Test | VideoQA | video, text | 2'135 | 25.98 GB |
-| ActivityNet | VideoQA | video, text | 9'064 | 181.24 GB |
-| HiREST | VideoQA | video, text | 525 | 27.54 GB |
-| YouCook2 | VideoQA | video, text | 1'180 | 77.65 GB |
-| DiDeMo | VideoQA | video, text | 7'452 | 33.90 GB |
-| EGO4D | VideoQA | video, text | 2'665 | 194.01 GB |
-| MedVidQA | VideoQA | video, text | 933 | 40.35 GB |
-| QuerYD | VideoQA | video, text | 1'562 | 50.69 GB |
-| YouCook2 | VideoQA | video, text | 2'270 | 158.77 GB |
-| EgoExoLearn (open-ended QA) | VideoQA | video, text | 9'998 | 1751.69 GB |
-| Breakfast Actions | VideoQA | video, text | 1'204 | 3.45 GB |
-| EgoExoLearn (multi-choice QA) | VideoQA | video, text | 6'832 | 1196.41 GB |
-| CrossTask (multi-choice QA) | VideoQA | video, text | 75'686 | 417.50 GB |
-| CrossTask (open-ended QA) | VideoQA | video, text | 20'399 | 112.02 GB |
-| EgoProceL (multi-choice QA) | VideoQA | video, text | 4'789 | 42.74 GB |
-| EgoProceL (open-ended QA) | VideoQA | video, text | 5'667 | 50.58 GB |
-| HC-STVG (multi-choice QA) | VideoQA | video, text | 147'799 | 796.18 GB |
-| HC-STVG (open-ended QA) | VideoQA | video, text | 41'050 | 221.82 GB |
-| TAPOS (multi-choice QA) | VideoQA | video, text | 33'941 | 218.50 GB |
-| TAPOS (open-ended QA) | VideoQA | video, text | 13'991 | 88.00 GB |
-| Multi-page OCR based on CommonCrawl pdf data | VQA | image, text | 7'262 | 48.19 GB |
-| Multi-page QA based on CommonCrawl pdf data | VQA | image, text | 455 | 31.88 GB |
-| Table OCR dataset based on CommonCrawl pdf data | OCR | image, text | 4'281 | 0.68 GB |
-| Table OCR dataset based on CommonCrawl pdf data | OCR | image, text | 4'285 | 0.67 GB |
-| Table OCR dataset based on CommonCrawl pdf data | OCR | image, text | 4'282 | 0.67 GB |
-| Selection of public datasets (relabeled) | Image Reasoning | image, text | 13'843 | 4.18 GB |
-| Selection of public datasets (relabeled) | Image Reasoning | image, text | 18'442 | 3.89 GB |
-| Perception Test | VideoQA | video, text | 7'392 | 94.95 GB |
-| Perception Test (CoT) | VideoQA | video, text | 4'977 | 64.55 GB |
-<br>
 # Private Datasets <br>
-| Dataset Name | Type | Modalities | Number of Samples | Size |
-|--------------|------|------------|-------------------|------|
-| Internal safety alignment text dataset | Safety | Text | N/A | N/A |
-| Internal safety alignment text dataset | Safety | Text | N/A | N/A |
-| Synthetic dataset with HLE data with DeepSeek-R1-0528 | Text Reasoning | text | 445'958 | 9.01 GB |
-| Internal QA dataset on invoices | Image Reasoning | image, text | 6'471 | 5.22 GB |
-| Internal QA dataset on invoices | Image Reasoning | image, text | 11'258 | 10.19 GB |
-<br>
 # Data Crawling and Scraping <br>
-| Dataset Name | Type | Modalities | Number of Samples | Size |
-|--------------|------|------------|-------------------|------|
-| Internal video dataset | VideoQA | video, text | 274'472 | 348.84 GB |
-| Internal video dataset | VideoQA | video, text | 14'256 | 44.46 GB |
-| Internal VQA and captioning dataset | Image Captioning | image, text | 14'872 | 3.27 GB |
-| Internal VQA dataset | VQA | image, text | 20'250 | 1.87 GB |
-| Internal VQA dataset | VQA | image, text | 20'098 | 2.07 GB |
-| Internal Captioning dataset | Image Captioning | image, text | 24'998 | 6.97 GB |
-<br>
 # User-Sourced Data (Collected by Provider including Prompts) <br>
 <br>
 # Self-Sourced Synthetic Data <br>
-| Dataset Name | Type | Modalities | Number of Samples | Size |
-|--------------|------|------------|-------------------|------|
-| Random ASCII characters for OCR | OCR | image, text | 14'533 | 5.76 GB |
-| Random ASCII characters for OCR | OCR | image, text | 14'533 | 9.26 GB |
-| Random Chinese characters for OCR | OCR | image, text | 29'108 | 15.00 GB |
-| Random Chinese characters for OCR | OCR | image, text | 29'108 | 24.11 GB |
-| Random English characters for OCR | OCR | image, text | 14'525 | 5.65 GB |
-| Random English characters for OCR | OCR | image, text | 14'525 | 9.39 GB |
-| Synthetic sparse table dataset | OCR | image, text | 100'000 | 14.36 GB |
-| Synthetic dataset with OpenCodeReasoning 2.0 from DeepSeek-R1-0528 | Text Reasoning | text | 1'165'591 | 54.15 GB |
-| Synthetic dataset with OpenCodeReasoning 2.0 from DeepSeek-R1-0528 | Text Reasoning | text | 175'000 | 0.95 GB |
-| Synthetic dataset with OpenSTEM from DeepSeek-R1-0528 | Text Reasoning | text | 1'922'012 | 28.00 GB |
-| Synthetic dataset with OpenSTEM from DeepSeek-R1-0528 | Text Reasoning | text | 288'000 | 0.57 GB |
-| Synthetic dataset with HLE data with DeepSeek-R1-0528 | Text Reasoning | text | 67'000 | 0.22 GB |
-| Synthetic tool-calling data with seed tools from ToolBench, Glaive, xLAM and responses from Qwen3-235B-A22B with reasoning | Text Reasoning | text | 403'619 | 6.55 GB |
-| Synthetic safety data with responses from DeepSeek-R1-0528 | Text Reasoning | text | 30'710 | 0.12 GB |
-| Dummy conversation dataset | Text Reasoning | text | 2'262 | 0.00 GB |
-| Chat data with HelpSteer2 HelpSteer3 as seed user prompts and responses from Qwen3-235B-A22B with reasoning | Text Reasoning | text | 32'752 | 0.26 GB |
-| Chat data with HelpSteer2 HelpSteer3 as seed user prompts and responses from Qwen3-235B-A22B without reasoning | Text Reasoning | text | 3'636 | 0.01 GB |
-| Synthetic chat dataset with responses from DeepSeek-R1 | Text Reasoning | text | 389'350 | 3.30 GB |
-| Chat dataset with LMSYS-1M as seed user prompts and responses from Qwen3-235B-A22B with reasoning | Text Reasoning | text | 353'526 | 2.61 GB |
-| Chat dataset with LMSYS-1M as seed user prompts and responses from Qwen3-235B-A22B without reasoning | Text Reasoning | text | 361'733 | 1.12 GB |
-| Synthetic multilingual STEM from DeepSeek-R1-0528, Qwen2.5-32B-Instruct-AWQ, Qwen2.5-14B-Instruct | Text Reasoning | text | 4'999'794 | 86.68 GB |
-| Chat dataset with WildChat-1M as seed user prompts and responses from Qwen3-235B-A22B with reasoning | Text Reasoning | text | 545'844 | 5.25 GB |
-| Chat dataset with WildChat-1M as seed user prompts and responses from Qwen3-235B-A22B without reasoning | Text Reasoning | text | 81'876 | 0.43 GB |
-| Synthetic Math with OpenMathReasoning from DeepSeek-R1-0528 | Text Reasoning | text | 1'591'641 | 58.63 GB |
-| Synthetic Math with OpenMathReasoning from DeepSeek-R1-0528 | Text Reasoning | text | 239'467 | 0.52 GB |
-| Synthetic dataset with OpenCodeReasoning 2.0 from DeepSeek-R1-0528 | Code | text | 1'165'591 | 54.15 GB |
-| Synthetic tool calling dataset from DeepSeek-R1-0528 | Text Reasoning | text | 74'044 | 46.43 GB |
-<br>
@@ -737,8 +496,8 @@ Evaluation benchmarks scores: <br>
 # Inference: <br>
-**Acceleration Engine:** [vLLM] <br>
-**Acceleration Engine:** [TRT-LLM] <br>
 **Test Hardware:** <br>
 * NVIDIA L40S <br>

 ### Release Date:  <br>
 - Build.Nvidia.com [October 28th, 2025] via [nvidia/NVIDIA-Nemotron-Nano-VL-12B-V2](https://build.nvidia.com/nvidia/nvidia-nemotron-nano-vl-12b-v2)
+- Hugging Face [October 28th, 2025] via [nvidia/NVIDIA-Nemotron-Nano-VL-12B-V2-BF16](https://huggingface.co/nvidia/Nemotron-Nano-12B-v2-VL-BF16)
+- Hugging Face [October 28th, 2025] via [nvidia/NVIDIA-Nemotron-Nano-VL-12B-V2-FP8](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-FP8)
+- Hugging Face [October 28th, 2025] via [nvidia/NVIDIA-Nemotron-Nano-VL-12B-V2-NVFP4](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-NVFP4-QAD)
 ** Image based datasets were all scanned against known CSAM to make sure no such content was included in training.<br>
 # Public Datasets <br>
+| Type | Data Type | Total Samples | Total Size (GB) |
+|------|-----------|---------------|------------------|
+| Function call | text | 8,000 | 0.02 |
+| Image Captioning | image, text | 1,422,102 | 1,051.04 |
+| Image Reasoning | image, text | 1,888,217 | 286.95 |
+| OCR | image, text | 9,830,570 | 5,317.60 |
+| Referring Expression Grounding | image, text | 14,694 | 2.39 |
+| Safety | image, text | 34,187 | 9.21 |
+| Safety | text | 57,223 | 0.52 |
+| Safety | video, text | 12,988 | 11.78 |
+| Text Instruction Tuning | text | 245,056 | 1.13 |
+| Text Reasoning | text | 225,408 | 4.55 |
+| VQA | image, text | 8,174,136 | 2,207.52 |
+| VQA | video, text | 40,000 | 46.05 |
+| Video Captioning | video, text | 3,289 | 6.31 |
+| Video Reasoning | video, text | 42,620 | 49.10 |
+| VideoQA | video, text | 1,371,923 | 17,641.79 |
+| Visual Instruction Tuning | image, text | 1,173,877 | 167.79 |
+| **TOTAL** | | **24,544,290** | **26,803.75** |
 # Private Datasets <br>
+| Type | Modalities | Total Samples | Total Size (GB) |
+|------|------------|---------------|------------------|
+| Image Reasoning | image, text | 17,729 | 15.41 |
+| Text Reasoning | text | 445,958 | 9.01 |
+| **TOTAL** | | **463,687** | **24.42** |
 # Data Crawling and Scraping <br>
+| Type | Modalities | Total Samples | Total Size (GB) |
+|------|------------|---------------|------------------|
+| Image Captioning | image, text | 39,870 | 10.24 |
+| VQA | image, text | 40,348 | 3.94 |
+| VideoQA | video, text | 288,728 | 393.30 |
+| **TOTAL** | | **368,946** | **407.48** |
 # User-Sourced Data (Collected by Provider including Prompts) <br>
 <br>
 # Self-Sourced Synthetic Data <br>
+| Type | Data Type | Total Samples | Total Size (GB) |
+|------|-----------|---------------|------------------|
+| Code | text | 1,165,591 | 54.15 |
+| OCR | image, text | 216,332 | 83.53 |
+| Text Reasoning | text | 12,727,857 | 295.80 |
+| **TOTAL** | | **14,109,780** | **433.48** |
 # Inference: <br>
+**Acceleration Engine:** vLLM <br>
+**Acceleration Engine:** TRT-LLM <br>
 **Test Hardware:** <br>
 * NVIDIA L40S <br>