AI & ML interests

The Fellowship is a network of exceptional people from different backgrounds who contribute to open-source machine learning πŸ§™β€β™‚οΈπŸ¦Έβ€β™€οΈπŸ¦ΉπŸ§β€β™‚οΈ

merveΒ 
posted an update 4 days ago
view post
Post
632
so many open LLMs and image LoRAs dropped past week, here's some picks for you 🫑 merve/releases-july-18-687e3fbd2ab9b39c51f9238b

LLMs
> ByteDance released a bunch of translation models called Seed-X-RM (7B) ByteDance-Seed/Seed-X-RM-7B
> NVIDIA released reasoning models of which 32B surpassing the giant Qwen3-235B with cc-by-4.0 license πŸ‘ nvidia/openreasoning-nemotron-687730dae0170059860f1f01
> LG released a new EXAONE model (32B) LGAI-EXAONE/EXAONE-4.0-32B

VLMs/any-to-any
> vidore/colqwen-omni-v0.1 is a new any-to-any retriever (MIT)
> HiDream-ai/HiDream-E1-1 is image+text in image+text out model (MIT)

LoRAs
> There's a bunch of LoRAs based on Flux Kontext, gotta check out the collection 🀠
merveΒ 
posted an update 6 days ago
ariG23498Β 
posted an update 7 days ago
merveΒ 
posted an update 10 days ago
merveΒ 
posted an update 11 days ago
view post
Post
2546
Fine-tune Gemma3n on videos with audios inside with Colab A100 πŸ”₯
Just dropped the notebook where you can learn how to fine-tune Gemma3n on images+audio+text at the same time!

keep in mind, it's made for educational purposes 🫑 we do LoRA, audio resampling & video downsampling to be able to train <40GB VRAM

stretch modalities and unfreeze layers as you wish! πŸ™πŸ» merve/smol-vision
  • 1 reply
Β·
merveΒ 
posted an update 13 days ago
view post
Post
2393
past week had huuuge releases πŸ’—
here's our picks πŸ”₯ find more models, datasets, demos here merve/releases-july-11-68750452c358c98b0fa663f7

> moonshotai/Kimi-K2-Instruct is the new sota LLM with 1T total 32B active parameters 🀯

> HuggingFaceTB/SmolLM3-3B is the new best LM for it's size, offers thinking mode πŸ’­ as well as the dataset HuggingFaceTB/smoltalk2

> Alibaba-NLP/WebSailor-3B is the new agentic LLM for complex browsing

> Google DeepMind released medical vision LMs with an agentic doctor-patient app google/medgemma-release-680aade845f90bec6a3f60c4

> fal released a LoRA to improve details on face images fal/Realism-Detailer-Kontext-Dev-LoRA
merveΒ 
posted an update 19 days ago
view post
Post
3088
GitHub refuses to render notebooks for a long time now πŸ’”

so smol-vision now lives in Hugging Face model repository πŸ€— merve/smol-vision
  • 1 reply
Β·
merveΒ 
posted an update 20 days ago
view post
Post
3422
ByteDance released Tar 1.5B and 7B: image-text in image-text out models, fully open-source πŸ‘ ByteDance-Seed/tar-6864cf0d9fe59a3b91cc4260

They have an image tokenizer unified with text, and they de-tokenize using either of two models (LLM and diffusion)
The model is actually a full LLM (Qwen2), the tokenizer converts image tokens 🀯
chansungΒ 
posted an update 20 days ago
view post
Post
3626
YAML engineering becomes more and more important than ever from infra provisioning to model training (recipes).

Here, I built a simple editor first for @dstackai , and I will share the live endpoint this week. Let me know what you think about this approach.

Based on this approach, if people think this is useful, I am going to do the same thing for the LLM training recipes for popular frameworks such as Hugging Face open-r1, Axolotl, and so on. Let me hear.
merveΒ 
posted an update 20 days ago
view post
Post
3673
Huge drops in open AI past week!
Find more models, datasets, demos here merve/releases-july-4-686bcc54ed7c45c341fbf654
Some of our picks 🫑
⏯️ BAAI/MTVCraft is a new Veo3-like text-to-video model, demo is here BAAI/MTVCraft
πŸ§‘πŸ»β€πŸ’» apple/diffucoder-6868139f56672ae046fe04e8 is a new family of diffusion LLMs (7B base and instruct) for coding
πŸ—£οΈ kyutai/tts-1.6b-en_fr is a new small TTS model for English and France
πŸ‘€ aharley/alltracker is a new pixel tracking model by Stanford, demo is here aharley/alltracker
πŸ“– racineai/OGC_MEGA_MultiDomain_DocRetrieval is a new large visual document retrieval dataset
  • 1 reply
Β·
merveΒ 
posted an update 25 days ago
view post
Post
947
SOOOO MANY MODEL RELEASES 😍
Here's some picks from past week πŸ€—

> ByteDance/XVerse is a new identity preserving image generation model πŸ–ΌοΈ
> google/gemma-3n-E4B-it, any-to-text model supported by transformers πŸ€—
> nvidia/llama-nemoretriever-colembed-3b-v1 two new state-of-the-art visual document retrievers πŸ“‘
> New version of Dia TTS model is up nari-labs/Dia-1.6B-0626
> Black Forest Labs releases Kontext benchmark black-forest-labs/kontext-bench

Find more here merve/releases-june-27-6864e8eb17f7e3a8b444083c
merveΒ 
posted an update 26 days ago
tomaarsenΒ 
posted an update 26 days ago
view post
Post
2730
‼️Sentence Transformers v5.0 is out! The biggest update yet introduces Sparse Embedding models, encode methods improvements, Router module for asymmetric models & much more. Sparse + Dense = πŸ”₯ hybrid search performance! Details:

1️⃣ Sparse Encoder Models
Brand new support for sparse embedding models that generate high-dimensional embeddings (30,000+ dims) where <1% are non-zero:

- Full SPLADE, Inference-free SPLADE, and CSR architecture support
- 4 new modules, 12 new losses, 9 new evaluators
- Integration with @elastic-co , @opensearch-project , @NAVER LABS Europe, @qdrant , @IBM , etc.
- Decode interpretable embeddings to understand token importance
- Hybrid search integration to get the best of both worlds

2️⃣ Enhanced Encode Methods & Multi-Processing
- Introduce encode_query & encode_document automatically use predefined prompts
- No more manual pool management - just pass device list directly to encode()
- Much cleaner and easier to use than the old multi-process approach

3️⃣ Router Module & Advanced Training
- Router module with different processing paths for queries vs documents
- Custom learning rates for different parameter groups
- Composite loss logging - see individual loss components
- Perfect for two-tower architectures

4️⃣ Comprehensive Documentation & Training
- New Training Overview, Loss Overview, API Reference docs
- 6 new training example documentation pages
- Full integration examples with major search engines
- Extensive blogpost on training sparse models

Read the comprehensive blogpost about training sparse embedding models: https://huggingface.co/blog/train-sparse-encoder

See the full release notes here: https://github.com/UKPLab/sentence-transformers/releases/v5.0.0

What's next? We would love to hear from the community! What sparse encoder models would you like to see? And what new capabilities should Sentence Transformers handle - multimodal embeddings, late interaction models, or something else? Your feedback shapes our roadmap!
merveΒ 
posted an update 28 days ago
merveΒ 
posted an update about 1 month ago
view post
Post
611
Dataset Viewer for PDFs just landed on Hugging Face πŸ“–πŸ€— you can now preview all the PDFs easier than before!

on top of this, there's PdfFolder format to load the PDF datasets quicker πŸ’¨
> to use it, your dataset should follow a directory format like folder/train/doc1.pdf, folder/train/doc1.pdf
> if you want to include bounding boxes, labels etc. you can keep them in a metadata.csv file in the same folder 🀝

read document dataset docs https://huggingface.co/docs/datasets/main/en/document_dataset
check all the document datasets here https://huggingface.co/datasets?modality=modality:document&sort=trending πŸ“–
  • 1 reply
Β·
merveΒ 
posted an update about 1 month ago
view post
Post
651
we've merged LightGlue keypoint matcher to Hugging Face transformers! it allows commercial use when paired with an open-source keypoint detector πŸ™πŸ»

it works very well, try it yourself: ETH-CVG/LightGlue

here's an in-the-wild test with two images of the same place ‡️
  • 1 reply
Β·
merveΒ 
posted an update about 1 month ago
view post
Post
4346
Release picks of the past week is here! Find more models, datasets, Spaces here merve/june-20-releases-68594824d1f4dfa61aee3433

πŸ–ΌοΈ VLMs/OCR
> moonshotai/Kimi-VL-A3B-Thinking-2506 is a powerful reasoning vision LM, 3B active params, smarter with less tokens, supports long documents, videos πŸ‘ (OS)
> nanonets/Nanonets-OCR-s is 3.75B params OCR model based on Qwen2.5VL-3B-Instruct (OS)

πŸ’¬ LLMs
> moonshotai/Kimi-Dev-72B is a strong coding model based on Qwen2.5-72B (OS)
> Mistral released mistralai/Mistral-Small-3.2-24B-Instruct-2506, an update to their former model with better function calling & instruction following (OS)

πŸ—£οΈ Audio
> Google released google/magenta-realtime, real time music generation & audio synthesis (cc-by-4)
> kyutai released new speech-to-text models that come in 1B & 2B ( kyutai/stt-1b-en_fr, stt-2b-en_fr) with 0.5s and 2.5s delay

3D
> Tencent released tencent/Hunyuan3D-2.1 an image-to-3D model (see below)
merveΒ 
posted an update about 1 month ago
merveΒ 
posted an update about 1 month ago
clemΒ 
posted an update about 1 month ago