AI & ML interests

vision , multimedia , gradio, accessibility & cool demos

prithivMLmodsΒ 
posted an update about 11 hours ago
view post
Post
530
Explore OCR, Captioning, and Visual Understanding with Cutting-Edge Models on Hugging Face. πŸ€—πŸ§ͺ

I’ve put together a collection of Google Colab notebooks to experiment with some of the most exciting models available on the Hugging Face Hub focused on OCR, image captioning, and visual understanding tasks. [Image-to-Text] / [Image-Text-to-Text]

> πŸ“– OCR-ReportLab-Notebooks : prithivMLmods/OCR-ReportLab-Notebooks

These notebooks are built for quick prototyping and run on free T4 GPUs, making them perfect for experimentation, testing ideas, or just exploring what’s possible with modern vision-language models.

Note: The experimental notebooks are compiled with models that fit within the T4 GPU (free-tier) limits. More models along with their notebooks will be added over time.
prithivMLmodsΒ 
posted an update 3 days ago
view post
Post
2147
Excited to introduce the new experimental model "Qwen2.5-VL-7B-Abliterated-Caption-it", which is performing exceptionally well on image captioning tasks. This variant is specifically tailored for Abliterated Captioning and Uncensored Image Captioning. It is designed to generate highly detailed and descriptive captions across a broad range of visual categories including images with complex, sensitive, or nuanced content while handling varying aspect ratios and resolutions.πŸ§ͺπŸ€—

✨ Try the demo here : prithivMLmods/Qwen2.5-VL
✨ Qwen2.5-VL-7B-Abliterated-Caption-it : prithivMLmods/Qwen2.5-VL-7B-Abliterated-Caption-it
✨ Multimodal VLMs : prithivMLmods/multimodal-vlms-until-july25-688312e6b840e1e156f13027
✨ Multimodal Implementations : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

.
.
.
To know more about it, visit the model card of the respective model. !!
hesamationΒ 
posted an update 4 days ago
view post
Post
2513
longer context doesn't generate better responses. it can even hurt your llm/agent. 1M context window doesn't automatically make models smarter as it's not about the size; it's how you use it.

here are 4 types of context failure and why each one happens:

1. context poisoning: if hallucination finds its way into your context, the agent will rely on that false information to make its future moves. for example if the agent hallucinates about the "task description", all of its planning to solve the task would also be corrupt.

2. context distraction: when the context becomes too bloated, the model focuses too much on it rather than come up with novel ideas or to follow what it has learned during training. as Gemini 2.5 Pro technical report points out, as context grows significantly from 100K tokens, "the agent showed a tendency toward favoring repeating actions from its vast history rather than synthesizing novel plans".

3. context confusion: everyone lost it when MCPs became popular, it seemed like AGI was achieved. I suspected there is something wrong and there was: it's not just about providing tools, bloating the context with tool use derails the model from selecting the right one! even if you can fit all your tool metadata in the context, as their number grows, the model gets confused over which one to pick.

4. Context Clash: if you exchange conversation with a model step by step and provide information as you go along, chances are you get worse performance rather than providing all the useful information at once. one the model's context fills with wrong information, it's more difficult to guide it to embrace the right info. agents pull information from tools, documents, user queries, etc. and there is a chance that some of these information contradict each other, and it's not good new for agentic applications.

check this article by Drew Breunig for deeper read: https://www.dbreunig.com/2025/06/26/how-to-fix-your-context.html?ref=blog.langchain.com
  • 2 replies
Β·
prithivMLmodsΒ 
posted an update 4 days ago
view post
Post
2294
olmOCR [Allen AI] just got an upgrade! πŸ“ˆπŸ§‘β€πŸ³

The allenai/olmOCR-7B-0725 β€” fine-tuned with allenai/olmOCR-mix-0225 on top of Qwen/Qwen2.5-VL-7B-Instruct, pushing the boundaries of OCR technology. It takes a single document image as input, with the longest side resized to 1288 pixels. High-quality, openly available approach to parsing pdfs and other complex documents optical character recognition.

Try the demo here: prithivMLmods/Multimodal-OCR

✨ Model: allenai/olmOCR-7B-0725
✨ Model [fp8]: allenai/olmOCR-7B-0725-FP8
✨ Multimodal Implementations Space Collection: prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

.
.
.
To know more about it, visit the model card of the respective model. !!
AtAndDevΒ 
posted an update 5 days ago
view post
Post
202
Qwen 3 Coder is a personal attack to k2, and I love it.
It achieves near SOTA on LCB while not having reasoning.
Finally people are understanding that reasoning isnt necessary for high benches...

Qwen ftw!

DECENTRALIZE DECENTRALIZE DECENTRALIZE
TonicΒ 
posted an update 7 days ago
view post
Post
588
πŸ‘‹ Hey there folks,

just submitted my plugin idea to the G-Assist Plugin Hackathon by @nvidia . Check it out, it's a great way to use a local SLA model on a windows machine to easily and locally get things done ! https://github.com/NVIDIA/G-Assist
prithivMLmodsΒ 
posted an update 8 days ago
view post
Post
5027
Upgraded the step-by-step notebook for fine-tuning SigLIP2 on domain-specific image classification tasks. The notebook supports both datasets with predefined train/test splits and those with only a train split, making it suitable for low-resource, custom, and real-world classification scenarios. πŸ“’πŸ‘‰

➺ FineTuning-SigLIP2-Notebook : prithivMLmods/FineTuning-SigLIP2-Notebook

➺ GitHub : https://github.com/PRITHIVSAKTHIUR/FineTuning-SigLIP-2

➺ In the first, datasets include predefined train and test splits, enabling conventional supervised learning and generalization evaluation : prithivMLmods/FineTuning-SigLIP2-Notebook (.ipynb)

➺ In the second scenario, only a training split is available; in such cases, the training set is either partially reserved for validation or reused entirely for evaluation : prithivMLmods/FineTuning-SigLIP2-Notebook (.ipynb)

This flexibility supports experimentation in constrained or domain-specific settings, where standard test annotations may not exist.
TonicΒ 
posted an update 9 days ago
view post
Post
486
πŸ™‹πŸ»β€β™‚οΈ Hey there folks ,

Yesterday , Nvidia released a reasoning model that beats o3 on science, math and coding !

Today you can try it out here : Tonic/Nvidia-OpenReasoning

hope you like it !
prithivMLmodsΒ 
posted an update 9 days ago
view post
Post
4033
Dropping the general-purpose reasoning dataset Poseidon-Reasoning-5M, which supports general thought processes, math, and science β€” featuring a diverse mixture of domains 🌊 : prithivMLmods/Poseidon-Reasoning-5M

from datasets import load_dataset

dataset = load_dataset("prithivMLmods/Poseidon-Reasoning-5M", split="data")

The compact version is as follows β€” Poseidon-Reasoning-Mini-300K : prithivMLmods/Poseidon-Reasoning-Mini-300K


from datasets import load_dataset

dataset = load_dataset("prithivMLmods/Poseidon-Reasoning-Mini-300K", split="train")


Collection : prithivMLmods/poseidon-reasoning-6879ca98e118b307c781a9ba
prithivMLmodsΒ 
posted an update 13 days ago
view post
Post
2156
Open Omega Ξ© (Forge, Atom, Explora):
A Fusion of Math, Science, and Coding πŸ§ͺπŸ€—

Datasets :
⌯⌲ Open-Omega-Forge-1M [Mathematics, Coding, and Science]: prithivMLmods/Open-Omega-Forge-1M
⌯⌲ Open-Omega-Atom-1.5M [Mathematics and Science]: prithivMLmods/Open-Omega-Atom-1.5M
⌯⌲ Open-Omega-Explora-2.5M [Forge + Atom]: prithivMLmods/Open-Omega-Explora-2.5M
⌯⌲ Others [Subordinate portion] - Curated and blended modular dataset.

Models :
> Omega-Qwen3-Atom-8B : prithivMLmods/Omega-Qwen3-Atom-8B
> Omega-Qwen2.5-Coder-3B : prithivMLmods/Omega-Qwen2.5-Coder-3B

Dataset Collection: prithivMLmods/open-omega-a-fusion-of-math-science-and-coding-68756c37769fa39c4055cc0e

.
.
.
For more information, refer to the dataset card(s).

prithivMLmodsΒ 
posted an update 15 days ago
view post
Post
3811
Excited to bring the new models that are performing exceptionally well in document OCR, image captioning, and visual understanding tasks. Megalodon-OCR and Perseus-Doc-VL have both demonstrated significant improvements across key areas. You can explore live demos on Hugging Face Spaces to compare their performance with other top-tier models available on the hub. πŸ€—πŸ“„

Models & Spaces :
> Megalodon-OCR (3B) : prithivMLmods/Megalodon-OCR-Sync-0713
> Perseus-Doc-vl (7B): prithivMLmods/Perseus-Doc-vl-0712
> Doc-VLMs-OCR : prithivMLmods/Doc-VLMs-OCR
> core-OCR : prithivMLmods/core-OCR


Datasets Caption Mix :
> Corvus-OCR-Caption-Mix : prithivMLmods/Corvus-OCR-Caption-Mix
> Corvus-OCR-Caption-Mini-Mix : prithivMLmods/Corvus-OCR-Caption-Mini-Mix

Collections :
> Corvus OCR Caption Mix: prithivMLmods/corvus-ocr-caption-mix-687349bfaceffbd10976f0cc
> Captioning / OCR / DocTable : prithivMLmods/captioning-ocr-doctable-687382e1da822008bb5c06f2

GitHub :
> OCR-ReportLab : https://github.com/PRITHIVSAKTHIUR/OCR-ReportLab/blob/main/Megalodon-OCR-Sync-0713-ColabNotebook/Megalodon_OCR_Sync_0713_ReportLab.ipynb

Others Spaces :
> Multimodal-OCR : prithivMLmods/Multimodal-OCR
> Multimodal-VLMs : https://huggingface.co/spaces/prithivMLmods/Multimodal-VLMs
> Multimodal-OCR2 : prithivMLmods/Multimodal-OCR2
> Florence-2-Image-Caption : prithivMLmods/Florence-2-Image-Caption
> VisionScope-R2 : prithivMLmods/VisionScope-R2
> DocScope-R1 : prithivMLmods/DocScope-R1

.
.
.
To know more about it, visit the model card of the respective model. !!
TonicΒ 
posted an update 15 days ago
view post
Post
3238
πŸ™‹πŸ»β€β™‚οΈ Normalize adding compute & runtime traces to your model cards
  • 2 replies
Β·
hesamationΒ 
posted an update 16 days ago
view post
Post
4352
in case you didn’t know, Claude now has a developer training course with certificates,

this is better than anything you can find on Coursera.

covers Claude Code, MCP and its advanced topics and even more:

https://www.anthropic.com/learn/build-with-claude
prithivMLmodsΒ 
posted an update 18 days ago
view post
Post
2380
Demo of OCR & Math QA using multi-capable VLMs like MonkeyOCR-pro-1.2B, R1-One-Vision, VisionaryR1, Vision Matters-7B, and VIGAL-7B, all running together with support for both image and video inference. πŸͺ

✦ Demo Spaces :
β€· Multimodal VLMs : https://huggingface.co/spaces/prithivMLmods/Multimodal-VLMs

✦ Models :
β€· Visionary R1 : maifoundations/Visionary-R1
β€· MonkeyOCR [1.2B] : echo840/MonkeyOCR-pro-1.2B
β€· ViGaL 7B : yunfeixie/ViGaL-7B
β€· Lh41-1042-Magellanic-7B-0711 : prithivMLmods/Lh41-1042-Magellanic-7B-0711
β€· Vision Matters 7B : Yuting6/Vision-Matters-7B
β€· WR30a-Deep-7B-0711 : prithivMLmods/WR30a-Deep-7B-0711

✦ MonkeyOCR-pro-1.2B Colab T4 Demo [ notebook ]
β€· MonkeyOCR-pro-1.2B-ReportLab : https://github.com/PRITHIVSAKTHIUR/OCR-ReportLab/blob/main/MonkeyOCR-0709/MonkeyOCR-pro-1.2B-ReportLab.ipynb

✦ GitHub : https://github.com/PRITHIVSAKTHIUR/OCR-ReportLab

The community GPU grant was given by Hugging Face β€” special thanks to them.πŸ€—πŸš€

.
.
.
To know more about it, visit the model card of the respective model. !!
TonicΒ 
posted an update 21 days ago
view post
Post
475
Who's going to Raise Summit in Paris Tomorrow ?

If you're around , I would love to meet you :-)
prithivMLmodsΒ 
posted an update 25 days ago
view post
Post
3550
Multimodal OCR with ReportLab? On Colab T4? (Nanonets OCR, Monkey OCR, OCRFlux 3B, Typhoo OCR 3B?) .. Yeah, it’s possible. I’ve made a dedicated Colab notebook to experiment with these models (all built on top of Qwen2.5 VL). πŸ€—πŸš€

Download notebooks here :

✦︎ NanonetsOCR : https://colab.research.google.com/drive/1VvA-amvSVxGdWgIsh4_by6KWOtEs_Iqp
✦︎ MonkeyOCR : https://colab.research.google.com/drive/1vPCojbmlXjDFUt06FJ1tjgnj_zWK4mUo
✦︎ OCRFluxOCR : https://colab.research.google.com/drive/1TDoCXzWdF2hxVLbISqW6DjXAzOyI7pzf
✦︎ TyphoonOCR : https://colab.research.google.com/drive/1_59zvLNnn1kvbiSFxzA1WiqhpbW8RKbz

🜲 Github : https://github.com/PRITHIVSAKTHIUR/OCR-ReportLab-Notebooks

What does it do?

1. Performs OCR on the input image
2. Generates a DOCX or PDF file with the input image and the extracted text

.
.
.
To know more about it, visit the model card of the respective model. !!
prithivMLmodsΒ 
posted an update 26 days ago
view post
Post
1685
The bunch of comparable demos for Multimodal VLMs (excels in OCR, cinematography understanding, spatial reasoning, etc.) now up on the Hub πŸ€— β€” max recent till Jun'25.

✦ Demo Spaces β€”

> [Nanonets-OCR-s, MonkeyOCR, Typhoon-OCR-7B, SmolDocling] : prithivMLmods/Multimodal-OCR2
> [GLM-4.1v, docscopeOCR-7B, MonkeyOCR, coreOCR-7B] : prithivMLmods/core-OCR
> [Camel-Doc-OCR, ViLaSR-7B, OCRFlux-3B, ShotVL-7B] : prithivMLmods/Doc-VLMs-v2-Localization
> [SkyCaptioner-V1, SpaceThinker-3B, coreOCR-7B, SpaceOm-3B] : prithivMLmods/VisionScope-R2
> [RolmOCR-7B, Qwen2-VL-OCR-2B, Aya-Vision-8B, Nanonets-OCR-s] : prithivMLmods/Multimodal-OCR
> [DREX-062225-7B, Typhoon-OCR-3B, olmOCR-7B-0225, VIREX-062225-7B] : prithivMLmods/Doc-VLMs-OCR
> [Cosmos-Reason1-7B, docscopeOCR-7B, Captioner-7B, visionOCR-3B] : prithivMLmods/DocScope-R1

✦ Space Collection : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

.
.
.
To know more about it, visit the model card of the respective model. !!
  • 1 reply
Β·
NymboΒ 
posted an update 28 days ago
view post
Post
2155
Anyone know how to reset Claude web's MCP config? I connected mine when the HF MCP first released with just the default example spaces added. I added lots of other MCP spaces but Claude.ai doesn't update the available tools... "Disconnecting" the HF integration does nothing, deleting it and adding it again does nothing.

Refreshing tools works fine in VS Code because I can manually restart it in mcp.json, but claude.ai has no such option. Anyone got any ideas?
Β·
prithivMLmodsΒ 
posted an update 28 days ago
view post
Post
2436
The demo for Camel-Doc-OCR-062825 (exp) is optimized for document retrieval and direct Markdown (.md) generation from images and PDFs. Additional demos include OCRFlux-3B (document OCR), VilaSR (spatial reasoning with visual drawing), and ShotVL (cinematic language understanding). πŸͺ

✦ Space : prithivMLmods/Doc-VLMs-v2-Localization

Models :
β€· camel-doc-ocr-062825 : prithivMLmods/Camel-Doc-OCR-062825
β€· ocrflux-3b : ChatDOC/OCRFlux-3B
β€· vilasr : AntResearchNLP/ViLaSR
β€· shotvl : Vchitect/ShotVL-7B

β€· Multimodal Implementations : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

The community GPU grant was given by Hugging Face β€” special thanks to them. This space supports the following tasks: (image inference, video inference) with result markdown canvas and object detection/localization. πŸ€—πŸš€

.
.
.
To know more about it, visit the model card of the respective model. !!
prithivMLmodsΒ 
posted an update about 1 month ago
view post
Post
1995
The demo for DREX-062225-exp (Document Retrieval and Extraction eXpert ~ experimental) / typhoon-ocr-3b (a bilingual document parsing model built specifically for real-world documents) / VIREX-062225-exp (Video Information Retrieval and Extraction eXpert ~ experimental) / olmOCR-7B-0225-preview (the document parsing model based on Qwen2VL). πŸ€—

✦ Demo : prithivMLmods/Doc-VLMs-OCR ~ ( with .md canvas )

β€· DREX-062225-exp : prithivMLmods/DREX-062225-exp
β€· typhoon-ocr-3b : scb10x/typhoon-ocr-3b
β€· VIREX-062225-exp : prithivMLmods/VIREX-062225-exp
β€· olmOCR-7B-0225-preview : allenai/olmOCR-7B-0225-preview

β€· Collection : prithivMLmods/doc-vl-685839064a863e1cd23be3f1
β€· Multimodal Implementations : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0
.
.
.

To know more about it, visit the model card of the respective model. !!
Β·