rilixai (Rilix AI)

💬🔥Releasing idefics2-8b-chatty, the chat-optimized version of Idefics2!

It is a very efficient (8B parameters) state-of-the-art VLM, has been red-teamed, and comes with a few surprises:
- 📖Paper dissecting a lot of the experimental insights we learned building Idefics2:
- 🏎️TGI integration for blazing-fast inference (you can already run it locally with < 24GB GPU memory)
- 🏆 Ranking 2nd in its category (< 10B, open weights) in the awesome Open VLM Leaderboard, and now appearing in the incredible Vision Arena

Ressources:
⏯️Playground: HuggingFaceM4/idefics2_playground
📖Paper: What matters when building vision-language models? (2405.02246)
🏋️‍♂️Model and red-teaming analysis: HuggingFaceM4/idefics2-8b-chatty
👀Ressources to get started: HuggingFaceM4/idefics2-8b-chatty
🏆Open VLM Leaderboard: opencompass/open_vlm_leaderboard
🏟️Vision arena: WildVision/vision-arena

1 reply

·

VictorSanh

authored a paper almost 2 years ago

What matters when building vision-language models?

Paper • 2405.02246 • Published May 3, 2024 • 104

VictorSanh

posted an update almost 2 years ago

Post

2885

Glad to see Idefics2 making its way into the awesome OpenVLM Leaderboard which ranks VLMs. 🏆
2nd in its category (<10B parameters and open weights)!

While InternLM-XComposer2 uses proprietary data, Idefics2 is built solely using openly available data.

Leaderboard: opencompass/open_vlm_leaderboard
Model: HuggingFaceM4/idefics2-8b

9 replies

·

VictorSanh

authored 5 papers almost 2 years ago

TransferTransfo: A Transfer Learning Approach for Neural Network Based Conversational Agents

Paper • 1901.08149 • Published Jan 23, 2019 • 3

Movement Pruning: Adaptive Sparsity by Fine-Tuning

Paper • 2005.07683 • Published May 15, 2020

PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts

Paper • 2202.01279 • Published Feb 2, 2022

Learning from others' mistakes: Avoiding dataset biases without modeling them

Paper • 2012.01300 • Published Dec 2, 2020

A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks

Paper • 1811.06031 • Published Nov 14, 2018

VictorSanh

posted an update almost 2 years ago

Post

2545

Can't wait to see multimodal LLama 3!

We released a resource that might come in handy: The Cauldron 🍯

The Cauldron is a massive manually-curated collection of 50 vision-language sets for instruction fine-tuning. 3.6M images, 30.3M query/answer pairs.

It covers a large variety of downstream uses: visual question answering on natural images, OCR, document/charts/figures/tables understanding, textbooks/academic question, reasoning, captioning, spotting differences between 2 images, and screenshot-to-code.

HuggingFaceM4/the_cauldron

1 reply

·

VictorSanh

posted an update almost 2 years ago

Post

2771

New open multimodal model in town: Idefics2!

💪 Strong 8B-parameters model: often on par with open 30B counterparts.
🔓Open license: Apache 2.0.
🚀 Strong improvement over Idefics1: +12 points on VQAv2, +30 points on TextVQA while having 10x fewer parameters.
📚 Better data: boosting OCR capabilities with 6TB of documents to transcribe, and improving QA capabilities on charts/figures/diagrams.
🕵️‍♀️ Transparent training data: inspect and build upon all the data (10s of TB of data) we trained on.
🔲 More natural image processing: Incorporating strategies to treat images in their native resolution and native aspect ratio.
📸 High-resolution images: image resolutions up to 980 x 980 and integrating strategies that allow to trade computational efficiency for performance.
😎 2 checkpoints: Releasing both base checkpoint and instruction fine-tuned checkpoint. Chat version to come.

Ressources: HuggingFaceM4/idefics2-661d1971b7c50831dd3ce0fe
Blogpost: https://huggingface.co/blog/idefics2

AI & ML interests

Team members 5

rilixai's activity