Update README.md
Browse files
README.md
CHANGED
|
@@ -31,7 +31,9 @@ Nemotron Nano 12B V2 VL is a model for multi-modal document intelligence. It wou
|
|
| 31 |
|
| 32 |
### Release Date: <br>
|
| 33 |
- Build.Nvidia.com [October 28th, 2025] via [nvidia/NVIDIA-Nemotron-Nano-VL-12B-V2](https://build.nvidia.com/nvidia/nvidia-nemotron-nano-vl-12b-v2)
|
| 34 |
-
- Hugging Face [October 28th, 2025] via [nvidia/NVIDIA-Nemotron-Nano-VL-12B-V2](https://huggingface.co/nvidia/
|
|
|
|
|
|
|
| 35 |
|
| 36 |
|
| 37 |
|
|
@@ -389,297 +391,54 @@ Additional processing for several datasets included rule-based QA generation (e.
|
|
| 389 |
** Image based datasets were all scanned against known CSAM to make sure no such content was included in training.<br>
|
| 390 |
|
| 391 |
# Public Datasets <br>
|
| 392 |
-
|
|
| 393 |
-
|
| 394 |
-
|
|
| 395 |
-
|
|
| 396 |
-
|
|
| 397 |
-
|
|
| 398 |
-
|
|
| 399 |
-
|
|
| 400 |
-
|
|
| 401 |
-
|
|
| 402 |
-
|
|
| 403 |
-
|
|
| 404 |
-
|
|
| 405 |
-
|
|
| 406 |
-
|
|
| 407 |
-
|
|
| 408 |
-
|
|
| 409 |
-
|
|
| 410 |
-
|
|
| 411 |
-
| VCG+ 112K | VideoQA | video, text | 164 | 2.82 GB |
|
| 412 |
-
| Video Localized Narratives | Video Captioning | video, text | 373 | 0.64 GB |
|
| 413 |
-
| CLEVRER | VQA | video, text | 40'000 | 46.05 GB |
|
| 414 |
-
| NExT-QA | VideoQA | video, text | 10'368 | 57.06 GB |
|
| 415 |
-
| CLEVRER | Video Reasoning | video, text | 42'620 | 49.10 GB |
|
| 416 |
-
| ScreenQA | VQA | image, text | 302'004 | 30.52 GB |
|
| 417 |
-
| WikiSQL | Image Reasoning | image, text | N/A | N/A |
|
| 418 |
-
| WikiTableQuestions | TextQA | text | N/A | N/A |
|
| 419 |
-
| RenderedText | OCR | image, text | N/A | N/A |
|
| 420 |
-
| FinQA | Text Reasoning | text | N/A | N/A |
|
| 421 |
-
| TAT-QA | Text Reasoning | text | N/A | N/A |
|
| 422 |
-
| Databricks Dolly 15K | Text Instruction Tuning | text | N/A | N/A |
|
| 423 |
-
| WebSight | Image Classification | image, text | N/A | N/A |
|
| 424 |
-
| RAVEN | Image Reasoning | image, text | N/A | N/A |
|
| 425 |
-
| VizWiz | VQA | image, text | N/A | N/A |
|
| 426 |
-
| Inter-GPS | Image Reasoning | image, text | N/A | N/A |
|
| 427 |
-
| OCR dataset from arXiv data | OCR | image, text | 120'000 | 49.99 GB |
|
| 428 |
-
| OCR dataset from arXiv data | OCR | image, text | 599'927 | 249.93 GB |
|
| 429 |
-
| OCR dataset from arXiv data | OCR | image, text | 1'565'011 | 1637.79 GB |
|
| 430 |
-
| OCR dataset from arXiv data | OCR | image, text | 418'059 | 422.04 GB |
|
| 431 |
-
| OCR dataset from arXiv data | OCR | image, text | 200'001 | 200.89 GB |
|
| 432 |
-
| OCR dataset from arXiv data | OCR | image, text | 200'000 | 198.94 GB |
|
| 433 |
-
| OCR dataset from arXiv data | OCR | image, text | 200'001 | 196.08 GB |
|
| 434 |
-
| OCR dataset from arXiv data | OCR | image, text | 400'000 | 382.95 GB |
|
| 435 |
-
| OCR dataset from arXiv data | OCR | image, text | 400'000 | 388.16 GB |
|
| 436 |
-
| OCR dataset from arXiv data | OCR | image, text | 18'280 | 20.98 GB |
|
| 437 |
-
| DocLayNet (curated) | OCR | image, text | 48'369 | 18.59 GB |
|
| 438 |
-
| DocLayNet (curated & augmented) | OCR | image, text | 48'249 | 9.12 GB |
|
| 439 |
-
| DocLayNet (curated & augmented) | OCR | image, text | 48'267 | 9.09 GB |
|
| 440 |
-
| SynthTabNet | OCR | image, text | 200'000 | 9.70 GB |
|
| 441 |
-
| OCR dataset based on pdfs from CommonCrawl | OCR | image, text | 14'309 | 17.00 GB |
|
| 442 |
-
| OCR dataset based on pdfs from CommonCrawl | OCR | image, text | 8'461 | 7.77 GB |
|
| 443 |
-
| OCR dataset based on pdfs from CommonCrawl | OCR | image, text | 8'462 | 7.99 GB |
|
| 444 |
-
| OCR dataset based on pdfs from CommonCrawl | OCR | image, text | 14'236 | 5.84 GB |
|
| 445 |
-
| OCR dataset based on pdfs from CommonCrawl | OCR | image, text | 14'232 | 5.92 GB |
|
| 446 |
-
| SynthTables | OCR | image, text | 4'887 | 0.38 GB |
|
| 447 |
-
| TabRecSet | OCR | image, text | 25'281 | 2.46 GB |
|
| 448 |
-
| TabRecSet | OCR | image, text | 25'281 | 1.61 GB |
|
| 449 |
-
| FinTabNet | OCR | image, text | 57'137 | 9.22 GB |
|
| 450 |
-
| FinTabNet | OCR | image, text | 57'131 | 21.76 GB |
|
| 451 |
-
| FinTabNet | OCR | image, text | 57'129 | 21.68 GB |
|
| 452 |
-
| PubTables-1M | OCR | image, text | 224'170 | 29.55 GB |
|
| 453 |
-
| PubTables-1M | OCR | image, text | 224'169 | 36.32 GB |
|
| 454 |
-
| PubTables-1M | OCR | image, text | 225'108 | 36.45 GB |
|
| 455 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 37.13 GB |
|
| 456 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 33.38 GB |
|
| 457 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 32.85 GB |
|
| 458 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 31.15 GB |
|
| 459 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 30.30 GB |
|
| 460 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 38.40 GB |
|
| 461 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 27.09 GB |
|
| 462 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 29.52 GB |
|
| 463 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 30.49 GB |
|
| 464 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 30.14 GB |
|
| 465 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 100.14 GB |
|
| 466 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 93.82 GB |
|
| 467 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 93.96 GB |
|
| 468 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 90.61 GB |
|
| 469 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 89.89 GB |
|
| 470 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 95.75 GB |
|
| 471 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 85.65 GB |
|
| 472 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 91.01 GB |
|
| 473 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 90.29 GB |
|
| 474 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 84.66 GB |
|
| 475 |
-
| TextOCR | OCR | image, text | 21'727 | 5.83 GB |
|
| 476 |
-
| TextOCR | OCR | image, text | 21'138 | 2.83 GB |
|
| 477 |
-
| Table OCR on pdfs from CommonCrawl | OCR | image, text | 19'359 | 12.92 GB |
|
| 478 |
-
| Table OCR on pdfs from CommonCrawl | OCR | image, text | 19'351 | 14.57 GB |
|
| 479 |
-
| Table OCR on pdfs from CommonCrawl | OCR | image, text | 19'350 | 14.44 GB |
|
| 480 |
-
| HierText | OCR | image, text | 8'278 | 2.60 GB |
|
| 481 |
-
| FUNSD | OCR | image, text | 149 | 0.01 GB |
|
| 482 |
-
| Gretel Synthetic Safety Alignment | Safety | Text | 19'779 | 0.03 GB |
|
| 483 |
-
| Internal safety alignment multimodal dataset | Safety | image, text | 22'559 | 8.27 GB |
|
| 484 |
-
| ALFRED Action | Safety | video, text | 6'524 | 5.92 GB |
|
| 485 |
-
| ALFRED Goal | Safety | video, text | 6'464 | 5.86 GB |
|
| 486 |
-
| VQA-RAD | Safety | image, text | 1'793 | 0.09 GB |
|
| 487 |
-
| SLAKE | Safety | image, text | 9'835 | 0.85 GB |
|
| 488 |
-
| STEM MMLU-aux (subset) | Safety | text | 37'444 | 0.49 GB |
|
| 489 |
-
| Glaive & Xlam | Function call | text | 8'000 | 0.02 GB |
|
| 490 |
-
| Textbooks VQA | VQA | image, text | 46'745 | 10.85 GB |
|
| 491 |
-
| ai2d | VQA | image, text | 12'413 | 2.23 GB |
|
| 492 |
-
| ScienceQA | VQA | image, text | 12'716 | 0.39 GB |
|
| 493 |
-
| ScienceQA from LlaVA-OneVision | VQA | image, text | 19'196 | 0.65 GB |
|
| 494 |
-
| ChartQA | VQA | image, text | 15'121 | 0.68 GB |
|
| 495 |
-
| ChartQA (augmented) | VQA | image, text | 15'050 | 0.65 GB |
|
| 496 |
-
| ChartQA (CoT) | VQA | image, text | 23'571 | 1.04 GB |
|
| 497 |
-
| ChartQA | VQA | image, text | 60'438 | 2.69 GB |
|
| 498 |
-
| Geo170K | VQA | image, text | 13'263 | 0.07 GB |
|
| 499 |
-
| InfographicVQA | VQA | image, text | 23'946 | 8.21 GB |
|
| 500 |
-
| DocVQA | VQA | image, text | 39'463 | 26.29 GB |
|
| 501 |
-
| DocVQA (CoT) | Image Reasoning | image, text | 16'881 | 10.65 GB |
|
| 502 |
-
| ALLaVA-4V (subset) | Visual Instruction Tuning | image, text | 524'892 | 96.99 GB |
|
| 503 |
-
| ALLaVA-4V (subset) | Visual Instruction Tuning | image, text | 227'776 | 42.52 GB |
|
| 504 |
-
| TabMWP | Image Reasoning | image, text | 23'058 | 0.30 GB |
|
| 505 |
-
| PMC-VQA | VQA | image, text | 2'266 | 0.04 GB |
|
| 506 |
-
| OCR-VQA from The Cauldron | VQA | image, text | 165'746 | 5.79 GB |
|
| 507 |
-
| ST-VQA from The Cauldron | VQA | image, text | 17'232 | 0.68 GB |
|
| 508 |
-
| WebSight from The Cauldron | OCR | image, text | 9'809 | 1.84 GB |
|
| 509 |
-
| EST-VQA | VQA | image, text | 17'043 | 4.25 GB |
|
| 510 |
-
| TAL Handwritten English OCR | OCR | image, text | 9'998 | 0.22 GB |
|
| 511 |
-
| TAL Handwritten Math writing | OCR | image, text | 22'244 | 0.33 GB |
|
| 512 |
-
| SlideVQA | VQA | image, text | 5'773 | 0.42 GB |
|
| 513 |
-
| pixmo-docs | VQA | image, text | 251'165 | 34.88 GB |
|
| 514 |
-
| pixmo-cap | Image Captioning | image, text | 706'897 | 261.63 GB |
|
| 515 |
-
| pixmo-cap-qa | VQA | image, text | 214'978 | 56.72 GB |
|
| 516 |
-
| pixmo-ask-model-anything | Visual Instruction Tuning | image, text | 153'592 | 20.50 GB |
|
| 517 |
-
| TallyQA | VQA | image, text | 68'775 | 10.64 GB |
|
| 518 |
-
| Bounding box to text annotations on a subset of Open Images | VQA | image, text | 1'664'533 | 490.37 GB |
|
| 519 |
-
| Bounding box to text annotations on a subset of Open Images | VQA | image, text | 1'664'533 | 488.17 GB |
|
| 520 |
-
| Bounding box to text annotations on a subset of Open Images | VQA | image, text | 1'128'326 | 324.46 GB |
|
| 521 |
-
| TabMWP (CoT) | Image Reasoning | image, text | 20'305 | 0.28 GB |
|
| 522 |
-
| VisualWebInstruct | Visual Instruction Tuning | image, text | 260'419 | 7.41 GB |
|
| 523 |
-
| Internal collection of public text SFT datasets | Text Instruction Tuning | text | 197'938 | 1.04 GB |
|
| 524 |
-
| ReCTS from ICDAR2019 | OCR | image, text | 20'000 | 1.77 GB |
|
| 525 |
-
| RCTW from ICDAR2017 | OCR | image, text | 8'034 | 7.85 GB |
|
| 526 |
-
| OCR equation heavy dataset from arXiv data | OCR | image, text | 2'000 | 0.03 GB |
|
| 527 |
-
| Mulberry-SFT (CoT) | Image Reasoning | image, text | 191'332 | 30.80 GB |
|
| 528 |
-
| LLaVA-CoT-100k (CoT) | Image Reasoning | image, text | 63'013 | 8.18 GB |
|
| 529 |
-
| GeomVerse (CoT) | Image Reasoning | image, text | 9'298 | 0.90 GB |
|
| 530 |
-
| MapQA (CoT) | Image Reasoning | image, text | 16'832 | 1.77 GB |
|
| 531 |
-
| MetaMathQA (CoT) | Text Reasoning | text | 225'408 | 4.55 GB |
|
| 532 |
-
| MetaMathQA (CoT) | Image Reasoning | image, text | 220'544 | 4.48 GB |
|
| 533 |
-
| PlotQA (CoT) | Image Reasoning | image, text | 16'256 | 0.76 GB |
|
| 534 |
-
| Visual7W Telling (CoT) | Image Reasoning | image, text | 62'592 | 3.21 GB |
|
| 535 |
-
| Visual7W Pointing | VQA | image, text | 25'733 | 0.93 GB |
|
| 536 |
-
| VisText | Image Captioning | image, text | 9'969 | 0.52 GB |
|
| 537 |
-
| ScreenQA | VQA | image, text | 32'724 | 3.51 GB |
|
| 538 |
-
| wave-ui-25k | OCR | image, text | 24'978 | 11.44 GB |
|
| 539 |
-
| Charts2500 | VQA | image, text | 2'486 | 0.09 GB |
|
| 540 |
-
| Cyrillic | OCR | image, text | 72'284 | 1.49 GB |
|
| 541 |
-
| CMM-Math | Image Reasoning | image, text | 13'148 | 0.05 GB |
|
| 542 |
-
| SimChart9K | Image Reasoning | image, text | 9'536 | 0.69 GB |
|
| 543 |
-
| UniChart | Image Reasoning | image, text | 504'885 | 17.04 GB |
|
| 544 |
-
| CASIA-HWDB2-line | OCR | image, text | 2'193 | 0.09 GB |
|
| 545 |
-
| MMTab | VQA | image, text | 232'746 | 59.23 GB |
|
| 546 |
-
| ArxivQA | VQA | image, text | 99'995 | 17.32 GB |
|
| 547 |
-
| docmatix-single | VQA | image, text | 19'992 | 3.94 GB |
|
| 548 |
-
| DocReason525K | Image Reasoning | image, text | 25'863 | 33.80 GB |
|
| 549 |
-
| FigureQA | VQA | image, text | 100'000 | 2.37 GB |
|
| 550 |
-
| LRV-Instruction | Visual Instruction Tuning | image, text | 7'198 | 0.37 GB |
|
| 551 |
-
| VisualWebInstruct (CoT) | Image Reasoning | image, text | 48'929 | 4.37 GB |
|
| 552 |
-
| DocMatix (multi-page) | Image Reasoning | image, text | 19'969 | 8.66 GB |
|
| 553 |
-
| spot-the-diff | Image Reasoning | image, text | 8'007 | 1.45 GB |
|
| 554 |
-
| DocVQA (CoT) | Image Reasoning | image, text | 36'333 | 24.32 GB |
|
| 555 |
-
| DocVQA (CoT) | Image Reasoning | image, text | 45'710 | 2.10 GB |
|
| 556 |
-
| DocVQA (CoT) | Image Reasoning | image, text | 19'548 | 6.70 GB |
|
| 557 |
-
| Mulberry-SFT (subset, CoT) | Image Reasoning | image, text | 103'763 | 18.45 GB |
|
| 558 |
-
| UniGeo (CoT) | Image Reasoning | image, text | 9'728 | 0.05 GB |
|
| 559 |
-
| NIGHTS | Image Reasoning | image, text | 12'906 | 37.01 GB |
|
| 560 |
-
| Mantis-Instruct (CoT) | Image Reasoning | image, text | 67'723 | 13.86 GB |
|
| 561 |
-
| OCR dataset based on pdfs from CommonCrawl | Image Reasoning | image, text | 2'858 | 1.23 GB |
|
| 562 |
-
| OCR dataset based on pdfs from CommonCrawl | Image Reasoning | image, text | 586 | 0.46 GB |
|
| 563 |
-
| FinTabNet (relabeled) | Image Reasoning | image, text | 8'356 | 3.17 GB |
|
| 564 |
-
| Table OCR on pdfs from CommonCrawl | Image Reasoning | image, text | 4'846 | 3.65 GB |
|
| 565 |
-
| HierText (relabeled for QA) | Image Reasoning | image, text | 514 | 0.07 GB |
|
| 566 |
-
| ECD-10k-Images | Image Reasoning | image, text | 132'613 | 15.38 GB |
|
| 567 |
-
| ActivityNet (open-ended QA) | VideoQA | video, text | 6'490 | 162.22 GB |
|
| 568 |
-
| NExT-QA (multi-choice QA) | VideoQA | video, text | 5'496 | 11.07 GB |
|
| 569 |
-
| NExT-QA (open-ended QA) | VideoQA | video, text | 5'492 | 10.99 GB |
|
| 570 |
-
| NExT-QA (multi-choice QA) | VideoQA | video, text | 52 | 0.74 GB |
|
| 571 |
-
| NExT-QA (open-ended QA) | VideoQA | video, text | 61 | 0.85 GB |
|
| 572 |
-
| NExT-QA (open-ended QA) | VideoQA | video, text | 6'843 | 27.83 GB |
|
| 573 |
-
| NExT-QA (multi-choice QA) | VideoQA | video, text | 6'843 | 27.85 GB |
|
| 574 |
-
| ActivityNet (open-ended QA) | VideoQA | video, text | 7'420 | 102.81 GB |
|
| 575 |
-
| ActivityNet (open-ended QA) | VideoQA | video, text | 3'840 | 25.84 GB |
|
| 576 |
-
| NExT-QA (multi-choice QA) | VideoQA | video, text | 4'633 | 35.38 GB |
|
| 577 |
-
| NExT-QA (open-ended QA) | VideoQA | video, text | 4'694 | 35.84 GB |
|
| 578 |
-
| ActivityNet (open-ended QA) | VideoQA | video, text | 2'580 | 7.46 GB |
|
| 579 |
-
| Perception Test (multi-choice QA) | VideoQA | video, text | 1'785 | 18.67 GB |
|
| 580 |
-
| Perception Test (multi-choice QA) | VideoQA | video, text | 618 | 11.52 GB |
|
| 581 |
-
| NExT-QA | VideoQA | video, text | 34'132 | 150.86 GB |
|
| 582 |
-
| CLEVRER | VideoQA | video, text | 40'000 | 46.03 GB |
|
| 583 |
-
| Video dataset based on Kinetics | VideoQA | video, text | 39'452 | 26.15 GB |
|
| 584 |
-
| EGO4D | VideoQA | video, text | 7'797 | 3.38 GB |
|
| 585 |
-
| TVQA | VideoQA | video, text | 34'868 | 100.05 GB |
|
| 586 |
-
| EgoExoLearn | VideoQA | video, text | 36'373 | 8558.27 GB |
|
| 587 |
-
| Video dataset based on Kinetics | VideoQA | video, text | 647'883 | 890.56 GB |
|
| 588 |
-
| Mementos | VideoQA | video, text | 4'060 | 14.07 GB |
|
| 589 |
-
| Perception Test | VideoQA | video, text | 7'392 | 94.95 GB |
|
| 590 |
-
| ActivityNet | VideoQA | video, text | 10'021 | 191.49 GB |
|
| 591 |
-
| EGO4D | VideoQA | video, text | 1'506 | 137.00 GB |
|
| 592 |
-
| FineAction | VideoQA | video, text | 7'504 | 169.76 GB |
|
| 593 |
-
| HACS | VideoQA | video, text | 31'223 | 829.25 GB |
|
| 594 |
-
| HiREST | VideoQA | video, text | 822 | 42.50 GB |
|
| 595 |
-
| Perception Test | VideoQA | video, text | 2'135 | 25.98 GB |
|
| 596 |
-
| ActivityNet | VideoQA | video, text | 9'064 | 181.24 GB |
|
| 597 |
-
| HiREST | VideoQA | video, text | 525 | 27.54 GB |
|
| 598 |
-
| YouCook2 | VideoQA | video, text | 1'180 | 77.65 GB |
|
| 599 |
-
| DiDeMo | VideoQA | video, text | 7'452 | 33.90 GB |
|
| 600 |
-
| EGO4D | VideoQA | video, text | 2'665 | 194.01 GB |
|
| 601 |
-
| MedVidQA | VideoQA | video, text | 933 | 40.35 GB |
|
| 602 |
-
| QuerYD | VideoQA | video, text | 1'562 | 50.69 GB |
|
| 603 |
-
| YouCook2 | VideoQA | video, text | 2'270 | 158.77 GB |
|
| 604 |
-
| EgoExoLearn (open-ended QA) | VideoQA | video, text | 9'998 | 1751.69 GB |
|
| 605 |
-
| Breakfast Actions | VideoQA | video, text | 1'204 | 3.45 GB |
|
| 606 |
-
| EgoExoLearn (multi-choice QA) | VideoQA | video, text | 6'832 | 1196.41 GB |
|
| 607 |
-
| CrossTask (multi-choice QA) | VideoQA | video, text | 75'686 | 417.50 GB |
|
| 608 |
-
| CrossTask (open-ended QA) | VideoQA | video, text | 20'399 | 112.02 GB |
|
| 609 |
-
| EgoProceL (multi-choice QA) | VideoQA | video, text | 4'789 | 42.74 GB |
|
| 610 |
-
| EgoProceL (open-ended QA) | VideoQA | video, text | 5'667 | 50.58 GB |
|
| 611 |
-
| HC-STVG (multi-choice QA) | VideoQA | video, text | 147'799 | 796.18 GB |
|
| 612 |
-
| HC-STVG (open-ended QA) | VideoQA | video, text | 41'050 | 221.82 GB |
|
| 613 |
-
| TAPOS (multi-choice QA) | VideoQA | video, text | 33'941 | 218.50 GB |
|
| 614 |
-
| TAPOS (open-ended QA) | VideoQA | video, text | 13'991 | 88.00 GB |
|
| 615 |
-
| Multi-page OCR based on CommonCrawl pdf data | VQA | image, text | 7'262 | 48.19 GB |
|
| 616 |
-
| Multi-page QA based on CommonCrawl pdf data | VQA | image, text | 455 | 31.88 GB |
|
| 617 |
-
| Table OCR dataset based on CommonCrawl pdf data | OCR | image, text | 4'281 | 0.68 GB |
|
| 618 |
-
| Table OCR dataset based on CommonCrawl pdf data | OCR | image, text | 4'285 | 0.67 GB |
|
| 619 |
-
| Table OCR dataset based on CommonCrawl pdf data | OCR | image, text | 4'282 | 0.67 GB |
|
| 620 |
-
| Selection of public datasets (relabeled) | Image Reasoning | image, text | 13'843 | 4.18 GB |
|
| 621 |
-
| Selection of public datasets (relabeled) | Image Reasoning | image, text | 18'442 | 3.89 GB |
|
| 622 |
-
| Perception Test | VideoQA | video, text | 7'392 | 94.95 GB |
|
| 623 |
-
| Perception Test (CoT) | VideoQA | video, text | 4'977 | 64.55 GB |
|
| 624 |
|
| 625 |
|
| 626 |
-
<br>
|
| 627 |
-
|
| 628 |
# Private Datasets <br>
|
| 629 |
-
|
|
| 630 |
-
|
| 631 |
-
|
|
| 632 |
-
|
|
| 633 |
-
|
|
| 634 |
-
|
| 635 |
-
| Internal QA dataset on invoices | Image Reasoning | image, text | 11'258 | 10.19 GB |
|
| 636 |
-
<br>
|
| 637 |
|
| 638 |
# Data Crawling and Scraping <br>
|
| 639 |
-
|
|
| 640 |
-
|
| 641 |
-
|
|
| 642 |
-
|
|
| 643 |
-
|
|
| 644 |
-
|
|
| 645 |
-
|
| 646 |
-
| Internal Captioning dataset | Image Captioning | image, text | 24'998 | 6.97 GB |
|
| 647 |
-
<br>
|
| 648 |
|
| 649 |
# User-Sourced Data (Collected by Provider including Prompts) <br>
|
| 650 |
<br>
|
| 651 |
|
| 652 |
# Self-Sourced Synthetic Data <br>
|
| 653 |
-
|
|
| 654 |
-
|
| 655 |
-
|
|
| 656 |
-
|
|
| 657 |
-
|
|
| 658 |
-
|
|
| 659 |
-
| Random English characters for OCR | OCR | image, text | 14'525 | 5.65 GB |
|
| 660 |
-
| Random English characters for OCR | OCR | image, text | 14'525 | 9.39 GB |
|
| 661 |
-
| Synthetic sparse table dataset | OCR | image, text | 100'000 | 14.36 GB |
|
| 662 |
-
| Synthetic dataset with OpenCodeReasoning 2.0 from DeepSeek-R1-0528 | Text Reasoning | text | 1'165'591 | 54.15 GB |
|
| 663 |
-
| Synthetic dataset with OpenCodeReasoning 2.0 from DeepSeek-R1-0528 | Text Reasoning | text | 175'000 | 0.95 GB |
|
| 664 |
-
| Synthetic dataset with OpenSTEM from DeepSeek-R1-0528 | Text Reasoning | text | 1'922'012 | 28.00 GB |
|
| 665 |
-
| Synthetic dataset with OpenSTEM from DeepSeek-R1-0528 | Text Reasoning | text | 288'000 | 0.57 GB |
|
| 666 |
-
| Synthetic dataset with HLE data with DeepSeek-R1-0528 | Text Reasoning | text | 67'000 | 0.22 GB |
|
| 667 |
-
| Synthetic tool-calling data with seed tools from ToolBench, Glaive, xLAM and responses from Qwen3-235B-A22B with reasoning | Text Reasoning | text | 403'619 | 6.55 GB |
|
| 668 |
-
| Synthetic safety data with responses from DeepSeek-R1-0528 | Text Reasoning | text | 30'710 | 0.12 GB |
|
| 669 |
-
| Dummy conversation dataset | Text Reasoning | text | 2'262 | 0.00 GB |
|
| 670 |
-
| Chat data with HelpSteer2 HelpSteer3 as seed user prompts and responses from Qwen3-235B-A22B with reasoning | Text Reasoning | text | 32'752 | 0.26 GB |
|
| 671 |
-
| Chat data with HelpSteer2 HelpSteer3 as seed user prompts and responses from Qwen3-235B-A22B without reasoning | Text Reasoning | text | 3'636 | 0.01 GB |
|
| 672 |
-
| Synthetic chat dataset with responses from DeepSeek-R1 | Text Reasoning | text | 389'350 | 3.30 GB |
|
| 673 |
-
| Chat dataset with LMSYS-1M as seed user prompts and responses from Qwen3-235B-A22B with reasoning | Text Reasoning | text | 353'526 | 2.61 GB |
|
| 674 |
-
| Chat dataset with LMSYS-1M as seed user prompts and responses from Qwen3-235B-A22B without reasoning | Text Reasoning | text | 361'733 | 1.12 GB |
|
| 675 |
-
| Synthetic multilingual STEM from DeepSeek-R1-0528, Qwen2.5-32B-Instruct-AWQ, Qwen2.5-14B-Instruct | Text Reasoning | text | 4'999'794 | 86.68 GB |
|
| 676 |
-
| Chat dataset with WildChat-1M as seed user prompts and responses from Qwen3-235B-A22B with reasoning | Text Reasoning | text | 545'844 | 5.25 GB |
|
| 677 |
-
| Chat dataset with WildChat-1M as seed user prompts and responses from Qwen3-235B-A22B without reasoning | Text Reasoning | text | 81'876 | 0.43 GB |
|
| 678 |
-
| Synthetic Math with OpenMathReasoning from DeepSeek-R1-0528 | Text Reasoning | text | 1'591'641 | 58.63 GB |
|
| 679 |
-
| Synthetic Math with OpenMathReasoning from DeepSeek-R1-0528 | Text Reasoning | text | 239'467 | 0.52 GB |
|
| 680 |
-
| Synthetic dataset with OpenCodeReasoning 2.0 from DeepSeek-R1-0528 | Code | text | 1'165'591 | 54.15 GB |
|
| 681 |
-
| Synthetic tool calling dataset from DeepSeek-R1-0528 | Text Reasoning | text | 74'044 | 46.43 GB |
|
| 682 |
-
<br>
|
| 683 |
|
| 684 |
|
| 685 |
|
|
@@ -737,8 +496,8 @@ Evaluation benchmarks scores: <br>
|
|
| 737 |
|
| 738 |
|
| 739 |
# Inference: <br>
|
| 740 |
-
**Acceleration Engine:**
|
| 741 |
-
**Acceleration Engine:**
|
| 742 |
|
| 743 |
**Test Hardware:** <br>
|
| 744 |
* NVIDIA L40S <br>
|
|
|
|
| 31 |
|
| 32 |
### Release Date: <br>
|
| 33 |
- Build.Nvidia.com [October 28th, 2025] via [nvidia/NVIDIA-Nemotron-Nano-VL-12B-V2](https://build.nvidia.com/nvidia/nvidia-nemotron-nano-vl-12b-v2)
|
| 34 |
+
- Hugging Face [October 28th, 2025] via [nvidia/NVIDIA-Nemotron-Nano-VL-12B-V2-BF16](https://huggingface.co/nvidia/Nemotron-Nano-12B-v2-VL-BF16)
|
| 35 |
+
- Hugging Face [October 28th, 2025] via [nvidia/NVIDIA-Nemotron-Nano-VL-12B-V2-FP8](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-FP8)
|
| 36 |
+
- Hugging Face [October 28th, 2025] via [nvidia/NVIDIA-Nemotron-Nano-VL-12B-V2-NVFP4](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-NVFP4-QAD)
|
| 37 |
|
| 38 |
|
| 39 |
|
|
|
|
| 391 |
** Image based datasets were all scanned against known CSAM to make sure no such content was included in training.<br>
|
| 392 |
|
| 393 |
# Public Datasets <br>
|
| 394 |
+
| Type | Data Type | Total Samples | Total Size (GB) |
|
| 395 |
+
|------|-----------|---------------|------------------|
|
| 396 |
+
| Function call | text | 8,000 | 0.02 |
|
| 397 |
+
| Image Captioning | image, text | 1,422,102 | 1,051.04 |
|
| 398 |
+
| Image Reasoning | image, text | 1,888,217 | 286.95 |
|
| 399 |
+
| OCR | image, text | 9,830,570 | 5,317.60 |
|
| 400 |
+
| Referring Expression Grounding | image, text | 14,694 | 2.39 |
|
| 401 |
+
| Safety | image, text | 34,187 | 9.21 |
|
| 402 |
+
| Safety | text | 57,223 | 0.52 |
|
| 403 |
+
| Safety | video, text | 12,988 | 11.78 |
|
| 404 |
+
| Text Instruction Tuning | text | 245,056 | 1.13 |
|
| 405 |
+
| Text Reasoning | text | 225,408 | 4.55 |
|
| 406 |
+
| VQA | image, text | 8,174,136 | 2,207.52 |
|
| 407 |
+
| VQA | video, text | 40,000 | 46.05 |
|
| 408 |
+
| Video Captioning | video, text | 3,289 | 6.31 |
|
| 409 |
+
| Video Reasoning | video, text | 42,620 | 49.10 |
|
| 410 |
+
| VideoQA | video, text | 1,371,923 | 17,641.79 |
|
| 411 |
+
| Visual Instruction Tuning | image, text | 1,173,877 | 167.79 |
|
| 412 |
+
| **TOTAL** | | **24,544,290** | **26,803.75** |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 413 |
|
| 414 |
|
|
|
|
|
|
|
| 415 |
# Private Datasets <br>
|
| 416 |
+
| Type | Modalities | Total Samples | Total Size (GB) |
|
| 417 |
+
|------|------------|---------------|------------------|
|
| 418 |
+
| Image Reasoning | image, text | 17,729 | 15.41 |
|
| 419 |
+
| Text Reasoning | text | 445,958 | 9.01 |
|
| 420 |
+
| **TOTAL** | | **463,687** | **24.42** |
|
| 421 |
+
|
|
|
|
|
|
|
| 422 |
|
| 423 |
# Data Crawling and Scraping <br>
|
| 424 |
+
| Type | Modalities | Total Samples | Total Size (GB) |
|
| 425 |
+
|------|------------|---------------|------------------|
|
| 426 |
+
| Image Captioning | image, text | 39,870 | 10.24 |
|
| 427 |
+
| VQA | image, text | 40,348 | 3.94 |
|
| 428 |
+
| VideoQA | video, text | 288,728 | 393.30 |
|
| 429 |
+
| **TOTAL** | | **368,946** | **407.48** |
|
| 430 |
+
|
|
|
|
|
|
|
| 431 |
|
| 432 |
# User-Sourced Data (Collected by Provider including Prompts) <br>
|
| 433 |
<br>
|
| 434 |
|
| 435 |
# Self-Sourced Synthetic Data <br>
|
| 436 |
+
| Type | Data Type | Total Samples | Total Size (GB) |
|
| 437 |
+
|------|-----------|---------------|------------------|
|
| 438 |
+
| Code | text | 1,165,591 | 54.15 |
|
| 439 |
+
| OCR | image, text | 216,332 | 83.53 |
|
| 440 |
+
| Text Reasoning | text | 12,727,857 | 295.80 |
|
| 441 |
+
| **TOTAL** | | **14,109,780** | **433.48** |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 442 |
|
| 443 |
|
| 444 |
|
|
|
|
| 496 |
|
| 497 |
|
| 498 |
# Inference: <br>
|
| 499 |
+
**Acceleration Engine:** vLLM <br>
|
| 500 |
+
**Acceleration Engine:** TRT-LLM <br>
|
| 501 |
|
| 502 |
**Test Hardware:** <br>
|
| 503 |
* NVIDIA L40S <br>
|