💬🔥Releasing idefics2-8b-chatty, the chat-optimized version of Idefics2!
It is a very efficient (8B parameters) state-of-the-art VLM, has been red-teamed, and comes with a few surprises: - 📖Paper dissecting a lot of the experimental insights we learned building Idefics2: - 🏎️TGI integration for blazing-fast inference (you can already run it locally with < 24GB GPU memory) - 🏆 Ranking 2nd in its category (< 10B, open weights) in the awesome Open VLM Leaderboard, and now appearing in the incredible Vision Arena
We released a resource that might come in handy: The Cauldron 🍯
The Cauldron is a massive manually-curated collection of 50 vision-language sets for instruction fine-tuning. 3.6M images, 30.3M query/answer pairs.
It covers a large variety of downstream uses: visual question answering on natural images, OCR, document/charts/figures/tables understanding, textbooks/academic question, reasoning, captioning, spotting differences between 2 images, and screenshot-to-code. HuggingFaceM4/the_cauldron
💪 Strong 8B-parameters model: often on par with open 30B counterparts. 🔓Open license: Apache 2.0. 🚀 Strong improvement over Idefics1: +12 points on VQAv2, +30 points on TextVQA while having 10x fewer parameters. 📚 Better data: boosting OCR capabilities with 6TB of documents to transcribe, and improving QA capabilities on charts/figures/diagrams. 🕵️♀️ Transparent training data: inspect and build upon all the data (10s of TB of data) we trained on. 🔲 More natural image processing: Incorporating strategies to treat images in their native resolution and native aspect ratio. 📸 High-resolution images: image resolutions up to 980 x 980 and integrating strategies that allow to trade computational efficiency for performance. 😎 2 checkpoints: Releasing both base checkpoint and instruction fine-tuned checkpoint. Chat version to come.