The Fellowship is a network of exceptional people from different backgrounds who contribute to open-source machine learning π§ββοΈπ¦ΈββοΈπ¦Ήπ§ββοΈ
Fine-tune Gemma3n on videos with audios inside with Colab A100 π₯ Just dropped the notebook where you can learn how to fine-tune Gemma3n on images+audio+text at the same time!
keep in mind, it's made for educational purposes π«‘ we do LoRA, audio resampling & video downsampling to be able to train <40GB VRAM stretch modalities and unfreeze layers as you wish! ππ» merve/smol-vision
They have an image tokenizer unified with text, and they de-tokenize using either of two models (LLM and diffusion) The model is actually a full LLM (Qwen2), the tokenizer converts image tokens π€―
YAML engineering becomes more and more important than ever from infra provisioning to model training (recipes).
Here, I built a simple editor first for @dstackai, and I will share the live endpoint this week. Let me know what you think about this approach.
Based on this approach, if people think this is useful, I am going to do the same thing for the LLM training recipes for popular frameworks such as Hugging Face open-r1, Axolotl, and so on. Let me hear.
βΌοΈSentence Transformers v5.0 is out! The biggest update yet introduces Sparse Embedding models, encode methods improvements, Router module for asymmetric models & much more. Sparse + Dense = π₯ hybrid search performance! Details:
1οΈβ£ Sparse Encoder Models Brand new support for sparse embedding models that generate high-dimensional embeddings (30,000+ dims) where <1% are non-zero:
- Full SPLADE, Inference-free SPLADE, and CSR architecture support - 4 new modules, 12 new losses, 9 new evaluators - Integration with @elastic-co, @opensearch-project, @NAVER LABS Europe, @qdrant, @IBM, etc. - Decode interpretable embeddings to understand token importance - Hybrid search integration to get the best of both worlds
2οΈβ£ Enhanced Encode Methods & Multi-Processing - Introduce encode_query & encode_document automatically use predefined prompts - No more manual pool management - just pass device list directly to encode() - Much cleaner and easier to use than the old multi-process approach
3οΈβ£ Router Module & Advanced Training - Router module with different processing paths for queries vs documents - Custom learning rates for different parameter groups - Composite loss logging - see individual loss components - Perfect for two-tower architectures
4οΈβ£ Comprehensive Documentation & Training - New Training Overview, Loss Overview, API Reference docs - 6 new training example documentation pages - Full integration examples with major search engines - Extensive blogpost on training sparse models
What's next? We would love to hear from the community! What sparse encoder models would you like to see? And what new capabilities should Sentence Transformers handle - multimodal embeddings, late interaction models, or something else? Your feedback shapes our roadmap!
Dataset Viewer for PDFs just landed on Hugging Face ππ€ you can now preview all the PDFs easier than before!
on top of this, there's PdfFolder format to load the PDF datasets quicker π¨ > to use it, your dataset should follow a directory format like folder/train/doc1.pdf, folder/train/doc1.pdf > if you want to include bounding boxes, labels etc. you can keep them in a metadata.csv file in the same folder π€
we've merged LightGlue keypoint matcher to Hugging Face transformers! it allows commercial use when paired with an open-source keypoint detector ππ»
πΌοΈ VLMs/OCR > moonshotai/Kimi-VL-A3B-Thinking-2506 is a powerful reasoning vision LM, 3B active params, smarter with less tokens, supports long documents, videos π (OS) > nanonets/Nanonets-OCR-s is 3.75B params OCR model based on Qwen2.5VL-3B-Instruct (OS)
π£οΈ Audio > Google released google/magenta-realtime, real time music generation & audio synthesis (cc-by-4) > kyutai released new speech-to-text models that come in 1B & 2B (kyutai/stt-1b-en_fr, stt-2b-en_fr) with 0.5s and 2.5s delay
y'all have been asking my opinion on how OCR models compare to each other π I will leave three apps to compare newest models by @prithivMLmods instead β€΅οΈ > compare Nanonets-OCR-s, Qwen2-VL-OCR-2B-Instruct, RolmOCR, Aya-Vision prithivMLmods/Multimodal-OCR > SmolDocling, Nanonets-OCR-s, MonkeyOCR, Typhoon-OCR-7B prithivMLmods/Multimodal-OCR2 > docscopeOCR, MonkeyOCR, coreOCR prithivMLmods/core-OCR