Image-Text-to-Text
Transformers
Safetensors
nvidia
VLM
conversational
alejandrar commited on
Commit
a0033df
·
verified ·
1 Parent(s): 9952e1c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +43 -284
README.md CHANGED
@@ -31,7 +31,9 @@ Nemotron Nano 12B V2 VL is a model for multi-modal document intelligence. It wou
31
 
32
  ### Release Date: <br>
33
  - Build.Nvidia.com [October 28th, 2025] via [nvidia/NVIDIA-Nemotron-Nano-VL-12B-V2](https://build.nvidia.com/nvidia/nvidia-nemotron-nano-vl-12b-v2)
34
- - Hugging Face [October 28th, 2025] via [nvidia/NVIDIA-Nemotron-Nano-VL-12B-V2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16)
 
 
35
 
36
 
37
 
@@ -389,297 +391,54 @@ Additional processing for several datasets included rule-based QA generation (e.
389
  ** Image based datasets were all scanned against known CSAM to make sure no such content was included in training.<br>
390
 
391
  # Public Datasets <br>
392
- | Dataset Name | Type | Modalities | Number of Samples | Size |
393
- |--------------|------|------------|-------------------|------|
394
- | Captioning on Open Images (subset, relabeled) | VQA | image, text | 1'278'221 | 378.34 GB |
395
- | Localized Narratives (subset, relabeled) | VQA | image, text | 503'275 | 147.67 GB |
396
- | TextCaps (subset) | Image Captioning | image, text | 21'953 | 5.76 GB |
397
- | TextCaps (subset) | Image Captioning | image, text | 109'765 | 28.81 GB |
398
- | TextVQA (subset) | Image Captioning | image, text | 34'602 | 9.08 GB |
399
- | RefCoco | Referring Expression Grounding | image, text | 14'694 | 2.39 GB |
400
- | VQAv2 | VQA | image, text | 28'555 | 4.41 GB |
401
- | AOKVQA | VQA | image, text | 20'832 | 3.39 GB |
402
- | GQA | VQA | image, text | 21'433 | 2.94 GB |
403
- | AOKVQA | VQA | image, text | 16'131 | 2.62 GB |
404
- | synthdog-en | OCR | image, text | 29'672 | 2.31 GB |
405
- | WIT | Image Captioning | image, text | 538'916 | 745.24 GB |
406
- | CLEVR | Image Reasoning | image, text | 70'000 | 12.57 GB |
407
- | CLEVR-Math | Image Reasoning | image, text | 70'000 | 12.47 GB |
408
- | OpenAssistant (oasst1, oasst2) | Text Instruction Tuning | text | 47'118 | 0.09 GB |
409
- | VATEX | Video Captioning | video, text | 2'880 | 5.50 GB |
410
- | YouCook2 | Video Captioning | video, text | 36 | 0.17 GB |
411
- | VCG+ 112K | VideoQA | video, text | 164 | 2.82 GB |
412
- | Video Localized Narratives | Video Captioning | video, text | 373 | 0.64 GB |
413
- | CLEVRER | VQA | video, text | 40'000 | 46.05 GB |
414
- | NExT-QA | VideoQA | video, text | 10'368 | 57.06 GB |
415
- | CLEVRER | Video Reasoning | video, text | 42'620 | 49.10 GB |
416
- | ScreenQA | VQA | image, text | 302'004 | 30.52 GB |
417
- | WikiSQL | Image Reasoning | image, text | N/A | N/A |
418
- | WikiTableQuestions | TextQA | text | N/A | N/A |
419
- | RenderedText | OCR | image, text | N/A | N/A |
420
- | FinQA | Text Reasoning | text | N/A | N/A |
421
- | TAT-QA | Text Reasoning | text | N/A | N/A |
422
- | Databricks Dolly 15K | Text Instruction Tuning | text | N/A | N/A |
423
- | WebSight | Image Classification | image, text | N/A | N/A |
424
- | RAVEN | Image Reasoning | image, text | N/A | N/A |
425
- | VizWiz | VQA | image, text | N/A | N/A |
426
- | Inter-GPS | Image Reasoning | image, text | N/A | N/A |
427
- | OCR dataset from arXiv data | OCR | image, text | 120'000 | 49.99 GB |
428
- | OCR dataset from arXiv data | OCR | image, text | 599'927 | 249.93 GB |
429
- | OCR dataset from arXiv data | OCR | image, text | 1'565'011 | 1637.79 GB |
430
- | OCR dataset from arXiv data | OCR | image, text | 418'059 | 422.04 GB |
431
- | OCR dataset from arXiv data | OCR | image, text | 200'001 | 200.89 GB |
432
- | OCR dataset from arXiv data | OCR | image, text | 200'000 | 198.94 GB |
433
- | OCR dataset from arXiv data | OCR | image, text | 200'001 | 196.08 GB |
434
- | OCR dataset from arXiv data | OCR | image, text | 400'000 | 382.95 GB |
435
- | OCR dataset from arXiv data | OCR | image, text | 400'000 | 388.16 GB |
436
- | OCR dataset from arXiv data | OCR | image, text | 18'280 | 20.98 GB |
437
- | DocLayNet (curated) | OCR | image, text | 48'369 | 18.59 GB |
438
- | DocLayNet (curated & augmented) | OCR | image, text | 48'249 | 9.12 GB |
439
- | DocLayNet (curated & augmented) | OCR | image, text | 48'267 | 9.09 GB |
440
- | SynthTabNet | OCR | image, text | 200'000 | 9.70 GB |
441
- | OCR dataset based on pdfs from CommonCrawl | OCR | image, text | 14'309 | 17.00 GB |
442
- | OCR dataset based on pdfs from CommonCrawl | OCR | image, text | 8'461 | 7.77 GB |
443
- | OCR dataset based on pdfs from CommonCrawl | OCR | image, text | 8'462 | 7.99 GB |
444
- | OCR dataset based on pdfs from CommonCrawl | OCR | image, text | 14'236 | 5.84 GB |
445
- | OCR dataset based on pdfs from CommonCrawl | OCR | image, text | 14'232 | 5.92 GB |
446
- | SynthTables | OCR | image, text | 4'887 | 0.38 GB |
447
- | TabRecSet | OCR | image, text | 25'281 | 2.46 GB |
448
- | TabRecSet | OCR | image, text | 25'281 | 1.61 GB |
449
- | FinTabNet | OCR | image, text | 57'137 | 9.22 GB |
450
- | FinTabNet | OCR | image, text | 57'131 | 21.76 GB |
451
- | FinTabNet | OCR | image, text | 57'129 | 21.68 GB |
452
- | PubTables-1M | OCR | image, text | 224'170 | 29.55 GB |
453
- | PubTables-1M | OCR | image, text | 224'169 | 36.32 GB |
454
- | PubTables-1M | OCR | image, text | 225'108 | 36.45 GB |
455
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 37.13 GB |
456
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 33.38 GB |
457
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 32.85 GB |
458
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 31.15 GB |
459
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 30.30 GB |
460
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 38.40 GB |
461
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 27.09 GB |
462
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 29.52 GB |
463
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 30.49 GB |
464
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 30.14 GB |
465
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 100.14 GB |
466
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 93.82 GB |
467
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 93.96 GB |
468
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 90.61 GB |
469
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 89.89 GB |
470
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 95.75 GB |
471
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 85.65 GB |
472
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 91.01 GB |
473
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 90.29 GB |
474
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 84.66 GB |
475
- | TextOCR | OCR | image, text | 21'727 | 5.83 GB |
476
- | TextOCR | OCR | image, text | 21'138 | 2.83 GB |
477
- | Table OCR on pdfs from CommonCrawl | OCR | image, text | 19'359 | 12.92 GB |
478
- | Table OCR on pdfs from CommonCrawl | OCR | image, text | 19'351 | 14.57 GB |
479
- | Table OCR on pdfs from CommonCrawl | OCR | image, text | 19'350 | 14.44 GB |
480
- | HierText | OCR | image, text | 8'278 | 2.60 GB |
481
- | FUNSD | OCR | image, text | 149 | 0.01 GB |
482
- | Gretel Synthetic Safety Alignment | Safety | Text | 19'779 | 0.03 GB |
483
- | Internal safety alignment multimodal dataset | Safety | image, text | 22'559 | 8.27 GB |
484
- | ALFRED Action | Safety | video, text | 6'524 | 5.92 GB |
485
- | ALFRED Goal | Safety | video, text | 6'464 | 5.86 GB |
486
- | VQA-RAD | Safety | image, text | 1'793 | 0.09 GB |
487
- | SLAKE | Safety | image, text | 9'835 | 0.85 GB |
488
- | STEM MMLU-aux (subset) | Safety | text | 37'444 | 0.49 GB |
489
- | Glaive & Xlam | Function call | text | 8'000 | 0.02 GB |
490
- | Textbooks VQA | VQA | image, text | 46'745 | 10.85 GB |
491
- | ai2d | VQA | image, text | 12'413 | 2.23 GB |
492
- | ScienceQA | VQA | image, text | 12'716 | 0.39 GB |
493
- | ScienceQA from LlaVA-OneVision | VQA | image, text | 19'196 | 0.65 GB |
494
- | ChartQA | VQA | image, text | 15'121 | 0.68 GB |
495
- | ChartQA (augmented) | VQA | image, text | 15'050 | 0.65 GB |
496
- | ChartQA (CoT) | VQA | image, text | 23'571 | 1.04 GB |
497
- | ChartQA | VQA | image, text | 60'438 | 2.69 GB |
498
- | Geo170K | VQA | image, text | 13'263 | 0.07 GB |
499
- | InfographicVQA | VQA | image, text | 23'946 | 8.21 GB |
500
- | DocVQA | VQA | image, text | 39'463 | 26.29 GB |
501
- | DocVQA (CoT) | Image Reasoning | image, text | 16'881 | 10.65 GB |
502
- | ALLaVA-4V (subset) | Visual Instruction Tuning | image, text | 524'892 | 96.99 GB |
503
- | ALLaVA-4V (subset) | Visual Instruction Tuning | image, text | 227'776 | 42.52 GB |
504
- | TabMWP | Image Reasoning | image, text | 23'058 | 0.30 GB |
505
- | PMC-VQA | VQA | image, text | 2'266 | 0.04 GB |
506
- | OCR-VQA from The Cauldron | VQA | image, text | 165'746 | 5.79 GB |
507
- | ST-VQA from The Cauldron | VQA | image, text | 17'232 | 0.68 GB |
508
- | WebSight from The Cauldron | OCR | image, text | 9'809 | 1.84 GB |
509
- | EST-VQA | VQA | image, text | 17'043 | 4.25 GB |
510
- | TAL Handwritten English OCR | OCR | image, text | 9'998 | 0.22 GB |
511
- | TAL Handwritten Math writing | OCR | image, text | 22'244 | 0.33 GB |
512
- | SlideVQA | VQA | image, text | 5'773 | 0.42 GB |
513
- | pixmo-docs | VQA | image, text | 251'165 | 34.88 GB |
514
- | pixmo-cap | Image Captioning | image, text | 706'897 | 261.63 GB |
515
- | pixmo-cap-qa | VQA | image, text | 214'978 | 56.72 GB |
516
- | pixmo-ask-model-anything | Visual Instruction Tuning | image, text | 153'592 | 20.50 GB |
517
- | TallyQA | VQA | image, text | 68'775 | 10.64 GB |
518
- | Bounding box to text annotations on a subset of Open Images | VQA | image, text | 1'664'533 | 490.37 GB |
519
- | Bounding box to text annotations on a subset of Open Images | VQA | image, text | 1'664'533 | 488.17 GB |
520
- | Bounding box to text annotations on a subset of Open Images | VQA | image, text | 1'128'326 | 324.46 GB |
521
- | TabMWP (CoT) | Image Reasoning | image, text | 20'305 | 0.28 GB |
522
- | VisualWebInstruct | Visual Instruction Tuning | image, text | 260'419 | 7.41 GB |
523
- | Internal collection of public text SFT datasets | Text Instruction Tuning | text | 197'938 | 1.04 GB |
524
- | ReCTS from ICDAR2019 | OCR | image, text | 20'000 | 1.77 GB |
525
- | RCTW from ICDAR2017 | OCR | image, text | 8'034 | 7.85 GB |
526
- | OCR equation heavy dataset from arXiv data | OCR | image, text | 2'000 | 0.03 GB |
527
- | Mulberry-SFT (CoT) | Image Reasoning | image, text | 191'332 | 30.80 GB |
528
- | LLaVA-CoT-100k (CoT) | Image Reasoning | image, text | 63'013 | 8.18 GB |
529
- | GeomVerse (CoT) | Image Reasoning | image, text | 9'298 | 0.90 GB |
530
- | MapQA (CoT) | Image Reasoning | image, text | 16'832 | 1.77 GB |
531
- | MetaMathQA (CoT) | Text Reasoning | text | 225'408 | 4.55 GB |
532
- | MetaMathQA (CoT) | Image Reasoning | image, text | 220'544 | 4.48 GB |
533
- | PlotQA (CoT) | Image Reasoning | image, text | 16'256 | 0.76 GB |
534
- | Visual7W Telling (CoT) | Image Reasoning | image, text | 62'592 | 3.21 GB |
535
- | Visual7W Pointing | VQA | image, text | 25'733 | 0.93 GB |
536
- | VisText | Image Captioning | image, text | 9'969 | 0.52 GB |
537
- | ScreenQA | VQA | image, text | 32'724 | 3.51 GB |
538
- | wave-ui-25k | OCR | image, text | 24'978 | 11.44 GB |
539
- | Charts2500 | VQA | image, text | 2'486 | 0.09 GB |
540
- | Cyrillic | OCR | image, text | 72'284 | 1.49 GB |
541
- | CMM-Math | Image Reasoning | image, text | 13'148 | 0.05 GB |
542
- | SimChart9K | Image Reasoning | image, text | 9'536 | 0.69 GB |
543
- | UniChart | Image Reasoning | image, text | 504'885 | 17.04 GB |
544
- | CASIA-HWDB2-line | OCR | image, text | 2'193 | 0.09 GB |
545
- | MMTab | VQA | image, text | 232'746 | 59.23 GB |
546
- | ArxivQA | VQA | image, text | 99'995 | 17.32 GB |
547
- | docmatix-single | VQA | image, text | 19'992 | 3.94 GB |
548
- | DocReason525K | Image Reasoning | image, text | 25'863 | 33.80 GB |
549
- | FigureQA | VQA | image, text | 100'000 | 2.37 GB |
550
- | LRV-Instruction | Visual Instruction Tuning | image, text | 7'198 | 0.37 GB |
551
- | VisualWebInstruct (CoT) | Image Reasoning | image, text | 48'929 | 4.37 GB |
552
- | DocMatix (multi-page) | Image Reasoning | image, text | 19'969 | 8.66 GB |
553
- | spot-the-diff | Image Reasoning | image, text | 8'007 | 1.45 GB |
554
- | DocVQA (CoT) | Image Reasoning | image, text | 36'333 | 24.32 GB |
555
- | DocVQA (CoT) | Image Reasoning | image, text | 45'710 | 2.10 GB |
556
- | DocVQA (CoT) | Image Reasoning | image, text | 19'548 | 6.70 GB |
557
- | Mulberry-SFT (subset, CoT) | Image Reasoning | image, text | 103'763 | 18.45 GB |
558
- | UniGeo (CoT) | Image Reasoning | image, text | 9'728 | 0.05 GB |
559
- | NIGHTS | Image Reasoning | image, text | 12'906 | 37.01 GB |
560
- | Mantis-Instruct (CoT) | Image Reasoning | image, text | 67'723 | 13.86 GB |
561
- | OCR dataset based on pdfs from CommonCrawl | Image Reasoning | image, text | 2'858 | 1.23 GB |
562
- | OCR dataset based on pdfs from CommonCrawl | Image Reasoning | image, text | 586 | 0.46 GB |
563
- | FinTabNet (relabeled) | Image Reasoning | image, text | 8'356 | 3.17 GB |
564
- | Table OCR on pdfs from CommonCrawl | Image Reasoning | image, text | 4'846 | 3.65 GB |
565
- | HierText (relabeled for QA) | Image Reasoning | image, text | 514 | 0.07 GB |
566
- | ECD-10k-Images | Image Reasoning | image, text | 132'613 | 15.38 GB |
567
- | ActivityNet (open-ended QA) | VideoQA | video, text | 6'490 | 162.22 GB |
568
- | NExT-QA (multi-choice QA) | VideoQA | video, text | 5'496 | 11.07 GB |
569
- | NExT-QA (open-ended QA) | VideoQA | video, text | 5'492 | 10.99 GB |
570
- | NExT-QA (multi-choice QA) | VideoQA | video, text | 52 | 0.74 GB |
571
- | NExT-QA (open-ended QA) | VideoQA | video, text | 61 | 0.85 GB |
572
- | NExT-QA (open-ended QA) | VideoQA | video, text | 6'843 | 27.83 GB |
573
- | NExT-QA (multi-choice QA) | VideoQA | video, text | 6'843 | 27.85 GB |
574
- | ActivityNet (open-ended QA) | VideoQA | video, text | 7'420 | 102.81 GB |
575
- | ActivityNet (open-ended QA) | VideoQA | video, text | 3'840 | 25.84 GB |
576
- | NExT-QA (multi-choice QA) | VideoQA | video, text | 4'633 | 35.38 GB |
577
- | NExT-QA (open-ended QA) | VideoQA | video, text | 4'694 | 35.84 GB |
578
- | ActivityNet (open-ended QA) | VideoQA | video, text | 2'580 | 7.46 GB |
579
- | Perception Test (multi-choice QA) | VideoQA | video, text | 1'785 | 18.67 GB |
580
- | Perception Test (multi-choice QA) | VideoQA | video, text | 618 | 11.52 GB |
581
- | NExT-QA | VideoQA | video, text | 34'132 | 150.86 GB |
582
- | CLEVRER | VideoQA | video, text | 40'000 | 46.03 GB |
583
- | Video dataset based on Kinetics | VideoQA | video, text | 39'452 | 26.15 GB |
584
- | EGO4D | VideoQA | video, text | 7'797 | 3.38 GB |
585
- | TVQA | VideoQA | video, text | 34'868 | 100.05 GB |
586
- | EgoExoLearn | VideoQA | video, text | 36'373 | 8558.27 GB |
587
- | Video dataset based on Kinetics | VideoQA | video, text | 647'883 | 890.56 GB |
588
- | Mementos | VideoQA | video, text | 4'060 | 14.07 GB |
589
- | Perception Test | VideoQA | video, text | 7'392 | 94.95 GB |
590
- | ActivityNet | VideoQA | video, text | 10'021 | 191.49 GB |
591
- | EGO4D | VideoQA | video, text | 1'506 | 137.00 GB |
592
- | FineAction | VideoQA | video, text | 7'504 | 169.76 GB |
593
- | HACS | VideoQA | video, text | 31'223 | 829.25 GB |
594
- | HiREST | VideoQA | video, text | 822 | 42.50 GB |
595
- | Perception Test | VideoQA | video, text | 2'135 | 25.98 GB |
596
- | ActivityNet | VideoQA | video, text | 9'064 | 181.24 GB |
597
- | HiREST | VideoQA | video, text | 525 | 27.54 GB |
598
- | YouCook2 | VideoQA | video, text | 1'180 | 77.65 GB |
599
- | DiDeMo | VideoQA | video, text | 7'452 | 33.90 GB |
600
- | EGO4D | VideoQA | video, text | 2'665 | 194.01 GB |
601
- | MedVidQA | VideoQA | video, text | 933 | 40.35 GB |
602
- | QuerYD | VideoQA | video, text | 1'562 | 50.69 GB |
603
- | YouCook2 | VideoQA | video, text | 2'270 | 158.77 GB |
604
- | EgoExoLearn (open-ended QA) | VideoQA | video, text | 9'998 | 1751.69 GB |
605
- | Breakfast Actions | VideoQA | video, text | 1'204 | 3.45 GB |
606
- | EgoExoLearn (multi-choice QA) | VideoQA | video, text | 6'832 | 1196.41 GB |
607
- | CrossTask (multi-choice QA) | VideoQA | video, text | 75'686 | 417.50 GB |
608
- | CrossTask (open-ended QA) | VideoQA | video, text | 20'399 | 112.02 GB |
609
- | EgoProceL (multi-choice QA) | VideoQA | video, text | 4'789 | 42.74 GB |
610
- | EgoProceL (open-ended QA) | VideoQA | video, text | 5'667 | 50.58 GB |
611
- | HC-STVG (multi-choice QA) | VideoQA | video, text | 147'799 | 796.18 GB |
612
- | HC-STVG (open-ended QA) | VideoQA | video, text | 41'050 | 221.82 GB |
613
- | TAPOS (multi-choice QA) | VideoQA | video, text | 33'941 | 218.50 GB |
614
- | TAPOS (open-ended QA) | VideoQA | video, text | 13'991 | 88.00 GB |
615
- | Multi-page OCR based on CommonCrawl pdf data | VQA | image, text | 7'262 | 48.19 GB |
616
- | Multi-page QA based on CommonCrawl pdf data | VQA | image, text | 455 | 31.88 GB |
617
- | Table OCR dataset based on CommonCrawl pdf data | OCR | image, text | 4'281 | 0.68 GB |
618
- | Table OCR dataset based on CommonCrawl pdf data | OCR | image, text | 4'285 | 0.67 GB |
619
- | Table OCR dataset based on CommonCrawl pdf data | OCR | image, text | 4'282 | 0.67 GB |
620
- | Selection of public datasets (relabeled) | Image Reasoning | image, text | 13'843 | 4.18 GB |
621
- | Selection of public datasets (relabeled) | Image Reasoning | image, text | 18'442 | 3.89 GB |
622
- | Perception Test | VideoQA | video, text | 7'392 | 94.95 GB |
623
- | Perception Test (CoT) | VideoQA | video, text | 4'977 | 64.55 GB |
624
 
625
 
626
- <br>
627
-
628
  # Private Datasets <br>
629
- | Dataset Name | Type | Modalities | Number of Samples | Size |
630
- |--------------|------|------------|-------------------|------|
631
- | Internal safety alignment text dataset | Safety | Text | N/A | N/A |
632
- | Internal safety alignment text dataset | Safety | Text | N/A | N/A |
633
- | Synthetic dataset with HLE data with DeepSeek-R1-0528 | Text Reasoning | text | 445'958 | 9.01 GB |
634
- | Internal QA dataset on invoices | Image Reasoning | image, text | 6'471 | 5.22 GB |
635
- | Internal QA dataset on invoices | Image Reasoning | image, text | 11'258 | 10.19 GB |
636
- <br>
637
 
638
  # Data Crawling and Scraping <br>
639
- | Dataset Name | Type | Modalities | Number of Samples | Size |
640
- |--------------|------|------------|-------------------|------|
641
- | Internal video dataset | VideoQA | video, text | 274'472 | 348.84 GB |
642
- | Internal video dataset | VideoQA | video, text | 14'256 | 44.46 GB |
643
- | Internal VQA and captioning dataset | Image Captioning | image, text | 14'872 | 3.27 GB |
644
- | Internal VQA dataset | VQA | image, text | 20'250 | 1.87 GB |
645
- | Internal VQA dataset | VQA | image, text | 20'098 | 2.07 GB |
646
- | Internal Captioning dataset | Image Captioning | image, text | 24'998 | 6.97 GB |
647
- <br>
648
 
649
  # User-Sourced Data (Collected by Provider including Prompts) <br>
650
  <br>
651
 
652
  # Self-Sourced Synthetic Data <br>
653
- | Dataset Name | Type | Modalities | Number of Samples | Size |
654
- |--------------|------|------------|-------------------|------|
655
- | Random ASCII characters for OCR | OCR | image, text | 14'533 | 5.76 GB |
656
- | Random ASCII characters for OCR | OCR | image, text | 14'533 | 9.26 GB |
657
- | Random Chinese characters for OCR | OCR | image, text | 29'108 | 15.00 GB |
658
- | Random Chinese characters for OCR | OCR | image, text | 29'108 | 24.11 GB |
659
- | Random English characters for OCR | OCR | image, text | 14'525 | 5.65 GB |
660
- | Random English characters for OCR | OCR | image, text | 14'525 | 9.39 GB |
661
- | Synthetic sparse table dataset | OCR | image, text | 100'000 | 14.36 GB |
662
- | Synthetic dataset with OpenCodeReasoning 2.0 from DeepSeek-R1-0528 | Text Reasoning | text | 1'165'591 | 54.15 GB |
663
- | Synthetic dataset with OpenCodeReasoning 2.0 from DeepSeek-R1-0528 | Text Reasoning | text | 175'000 | 0.95 GB |
664
- | Synthetic dataset with OpenSTEM from DeepSeek-R1-0528 | Text Reasoning | text | 1'922'012 | 28.00 GB |
665
- | Synthetic dataset with OpenSTEM from DeepSeek-R1-0528 | Text Reasoning | text | 288'000 | 0.57 GB |
666
- | Synthetic dataset with HLE data with DeepSeek-R1-0528 | Text Reasoning | text | 67'000 | 0.22 GB |
667
- | Synthetic tool-calling data with seed tools from ToolBench, Glaive, xLAM and responses from Qwen3-235B-A22B with reasoning | Text Reasoning | text | 403'619 | 6.55 GB |
668
- | Synthetic safety data with responses from DeepSeek-R1-0528 | Text Reasoning | text | 30'710 | 0.12 GB |
669
- | Dummy conversation dataset | Text Reasoning | text | 2'262 | 0.00 GB |
670
- | Chat data with HelpSteer2 HelpSteer3 as seed user prompts and responses from Qwen3-235B-A22B with reasoning | Text Reasoning | text | 32'752 | 0.26 GB |
671
- | Chat data with HelpSteer2 HelpSteer3 as seed user prompts and responses from Qwen3-235B-A22B without reasoning | Text Reasoning | text | 3'636 | 0.01 GB |
672
- | Synthetic chat dataset with responses from DeepSeek-R1 | Text Reasoning | text | 389'350 | 3.30 GB |
673
- | Chat dataset with LMSYS-1M as seed user prompts and responses from Qwen3-235B-A22B with reasoning | Text Reasoning | text | 353'526 | 2.61 GB |
674
- | Chat dataset with LMSYS-1M as seed user prompts and responses from Qwen3-235B-A22B without reasoning | Text Reasoning | text | 361'733 | 1.12 GB |
675
- | Synthetic multilingual STEM from DeepSeek-R1-0528, Qwen2.5-32B-Instruct-AWQ, Qwen2.5-14B-Instruct | Text Reasoning | text | 4'999'794 | 86.68 GB |
676
- | Chat dataset with WildChat-1M as seed user prompts and responses from Qwen3-235B-A22B with reasoning | Text Reasoning | text | 545'844 | 5.25 GB |
677
- | Chat dataset with WildChat-1M as seed user prompts and responses from Qwen3-235B-A22B without reasoning | Text Reasoning | text | 81'876 | 0.43 GB |
678
- | Synthetic Math with OpenMathReasoning from DeepSeek-R1-0528 | Text Reasoning | text | 1'591'641 | 58.63 GB |
679
- | Synthetic Math with OpenMathReasoning from DeepSeek-R1-0528 | Text Reasoning | text | 239'467 | 0.52 GB |
680
- | Synthetic dataset with OpenCodeReasoning 2.0 from DeepSeek-R1-0528 | Code | text | 1'165'591 | 54.15 GB |
681
- | Synthetic tool calling dataset from DeepSeek-R1-0528 | Text Reasoning | text | 74'044 | 46.43 GB |
682
- <br>
683
 
684
 
685
 
@@ -737,8 +496,8 @@ Evaluation benchmarks scores: <br>
737
 
738
 
739
  # Inference: <br>
740
- **Acceleration Engine:** [vLLM] <br>
741
- **Acceleration Engine:** [TRT-LLM] <br>
742
 
743
  **Test Hardware:** <br>
744
  * NVIDIA L40S <br>
 
31
 
32
  ### Release Date: <br>
33
  - Build.Nvidia.com [October 28th, 2025] via [nvidia/NVIDIA-Nemotron-Nano-VL-12B-V2](https://build.nvidia.com/nvidia/nvidia-nemotron-nano-vl-12b-v2)
34
+ - Hugging Face [October 28th, 2025] via [nvidia/NVIDIA-Nemotron-Nano-VL-12B-V2-BF16](https://huggingface.co/nvidia/Nemotron-Nano-12B-v2-VL-BF16)
35
+ - Hugging Face [October 28th, 2025] via [nvidia/NVIDIA-Nemotron-Nano-VL-12B-V2-FP8](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-FP8)
36
+ - Hugging Face [October 28th, 2025] via [nvidia/NVIDIA-Nemotron-Nano-VL-12B-V2-NVFP4](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-NVFP4-QAD)
37
 
38
 
39
 
 
391
  ** Image based datasets were all scanned against known CSAM to make sure no such content was included in training.<br>
392
 
393
  # Public Datasets <br>
394
+ | Type | Data Type | Total Samples | Total Size (GB) |
395
+ |------|-----------|---------------|------------------|
396
+ | Function call | text | 8,000 | 0.02 |
397
+ | Image Captioning | image, text | 1,422,102 | 1,051.04 |
398
+ | Image Reasoning | image, text | 1,888,217 | 286.95 |
399
+ | OCR | image, text | 9,830,570 | 5,317.60 |
400
+ | Referring Expression Grounding | image, text | 14,694 | 2.39 |
401
+ | Safety | image, text | 34,187 | 9.21 |
402
+ | Safety | text | 57,223 | 0.52 |
403
+ | Safety | video, text | 12,988 | 11.78 |
404
+ | Text Instruction Tuning | text | 245,056 | 1.13 |
405
+ | Text Reasoning | text | 225,408 | 4.55 |
406
+ | VQA | image, text | 8,174,136 | 2,207.52 |
407
+ | VQA | video, text | 40,000 | 46.05 |
408
+ | Video Captioning | video, text | 3,289 | 6.31 |
409
+ | Video Reasoning | video, text | 42,620 | 49.10 |
410
+ | VideoQA | video, text | 1,371,923 | 17,641.79 |
411
+ | Visual Instruction Tuning | image, text | 1,173,877 | 167.79 |
412
+ | **TOTAL** | | **24,544,290** | **26,803.75** |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
413
 
414
 
 
 
415
  # Private Datasets <br>
416
+ | Type | Modalities | Total Samples | Total Size (GB) |
417
+ |------|------------|---------------|------------------|
418
+ | Image Reasoning | image, text | 17,729 | 15.41 |
419
+ | Text Reasoning | text | 445,958 | 9.01 |
420
+ | **TOTAL** | | **463,687** | **24.42** |
421
+
 
 
422
 
423
  # Data Crawling and Scraping <br>
424
+ | Type | Modalities | Total Samples | Total Size (GB) |
425
+ |------|------------|---------------|------------------|
426
+ | Image Captioning | image, text | 39,870 | 10.24 |
427
+ | VQA | image, text | 40,348 | 3.94 |
428
+ | VideoQA | video, text | 288,728 | 393.30 |
429
+ | **TOTAL** | | **368,946** | **407.48** |
430
+
 
 
431
 
432
  # User-Sourced Data (Collected by Provider including Prompts) <br>
433
  <br>
434
 
435
  # Self-Sourced Synthetic Data <br>
436
+ | Type | Data Type | Total Samples | Total Size (GB) |
437
+ |------|-----------|---------------|------------------|
438
+ | Code | text | 1,165,591 | 54.15 |
439
+ | OCR | image, text | 216,332 | 83.53 |
440
+ | Text Reasoning | text | 12,727,857 | 295.80 |
441
+ | **TOTAL** | | **14,109,780** | **433.48** |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
442
 
443
 
444
 
 
496
 
497
 
498
  # Inference: <br>
499
+ **Acceleration Engine:** vLLM <br>
500
+ **Acceleration Engine:** TRT-LLM <br>
501
 
502
  **Test Hardware:** <br>
503
  * NVIDIA L40S <br>