AI & ML interests

Reproducible Open-Source Machine Learning ๐Ÿ™Œ๐Ÿป

satpalsrย 
posted an update 2 days ago
view post
Post
104
We're open-sourcing our infra with 10M+ frames of dataset!

We're releasing Stera, an open-source infra that turns an off-the-shelf device in your pocket into a high-fidelity multimodal data pipeline. It's built around four layers. Capture โ†’ Process โ†’ Evaluate โ†’ Export.

Stera Capture removes the need for bespoke/gated hardware and runs on an off-the-shelf iPhone. It fuses together synchronized RGB, IMU, Lidar-guided depth, and 6-DoF pose out of the box from ARKit and exports them to a raw MCAP file.

Dataset: fpvlabs/stera-10m
Launch Details: https://x.com/fpv_labs/status/2055262652033908832
satpalsrย 
posted an update about 1 month ago
view post
Post
179
OpenAI is hiring for SLAM Engineers!
And open-source shouldn't lag behind.

It's pretty hard and necessary problem required to be solved for bringing generalisable robots in real-world.

We are pushing out first deep down & will be open-sourcing stuff in the next releases. Hope everyone is ready! Cheers to HF & more hugs.

Find us at https://x.com/fpv_labs/status/2042585804162371713
nouamanetaziย 
posted an update 7 months ago
view post
Post
4837
After training ๐’๐ฆ๐จ๐ฅ๐‹๐Œ๐Ÿ‘ on ๐Ÿ‘๐Ÿ–๐Ÿ’ ๐‡๐Ÿ๐ŸŽ๐ŸŽ๐ฌ for nearly a month, I've come to realize something most people overlook: ๐ข๐ง๐Ÿ๐ซ๐š๐ฌ๐ญ๐ซ๐ฎ๐œ๐ญ๐ฎ๐ซ๐ž ๐ข๐ฌ ๐ญ๐ก๐ž ๐ฆ๐š๐ค๐ž-๐จ๐ซ-๐›๐ซ๐ž๐š๐ค ๐Ÿ๐š๐œ๐ญ๐จ๐ซ ๐ข๐ง ๐‹๐‹๐Œ ๐ญ๐ซ๐š๐ข๐ง๐ข๐ง๐ . ๐Ÿ”ฅ

Everyone talks about model architecture and data quality. And yes, those matter immensely. But here's what nobody tells you: when your training run fails at 2 AM because of mysterious ๐๐‚๐‚๐‹ ๐ž๐ซ๐ซ๐จ๐ซ๐ฌ, or when your expensive GPU cluster is running at ๐Ÿ”๐ŸŽ% ๐ž๐Ÿ๐Ÿ๐ข๐œ๐ข๐ž๐ง๐œ๐ฒ, the problem isn't your model. It's most probably a ๐ฆ๐ข๐ฌ๐ฎ๐ฌ๐ž ๐จ๐Ÿ ๐ญ๐ก๐ž ๐ก๐š๐ซ๐๐ฐ๐š๐ซ๐ž. ๐Ÿ› ๏ธ

Questions that seemed simple but had no clear answers: Why is ๐Œ๐จ๐„ ๐ญ๐ซ๐š๐ข๐ง๐ข๐ง๐  ๐ฌ๐ฅ๐จ๐ฐ๐ž๐ซ ๐ญ๐ก๐š๐ง ๐๐ž๐ง๐ฌ๐ž ๐ฆ๐จ๐๐ž๐ฅ๐ฌ? Which ๐๐‚๐‚๐‹ ๐Ÿ๐ฅ๐š๐ ๐ฌ should we actually set? How often should we checkpoint without killing throughput?

That's why we built ๐“๐ก๐ž ๐’๐ฆ๐จ๐ฅ ๐“๐ซ๐š๐ข๐ง๐ข๐ง๐  ๐๐ฅ๐š๐ฒ๐›๐จ๐จ๐ค ๐Ÿ“–: a complete guide covering everything from model architecture and data curation to the SmolLM3 training marathon, post-training techniques, and crucially, the ๐ข๐ง๐Ÿ๐ซ๐š๐ฌ๐ญ๐ซ๐ฎ๐œ๐ญ๐ฎ๐ซ๐ž ๐ฅ๐š๐ฒ๐ž๐ซ that most teams get wrong.

We validated real vs theoretical bandwidth across the entire stack: ๐‡๐๐Œ๐Ÿ‘ ๐ก๐ข๐ญ๐ญ๐ข๐ง๐  ๐Ÿ‘ ๐“๐/๐ฌ, ๐๐•๐‹๐ข๐ง๐ค ๐Ÿ’.๐ŸŽ ๐ซ๐ž๐š๐œ๐ก๐ข๐ง๐  ๐Ÿ•๐Ÿ–๐Ÿ” ๐†๐/๐ฌ, ๐๐‚๐ˆ๐ž ๐†๐ž๐ง๐Ÿ’ ๐š๐ญ ๐Ÿ๐Ÿ’.๐Ÿ ๐†๐/๐ฌ. Then we ran collective operations across ๐Ÿ๐Ÿ๐Ÿ– ๐†๐๐”๐ฌ (16 nodes, 8xH100s each) and measured how performance degrades at scale: all-reduce drops from ๐Ÿ’๐Ÿ–๐ŸŽ ๐†๐/๐ฌ on a single node to ๐Ÿ‘๐Ÿ๐ŸŽ-๐Ÿ‘๐Ÿ“๐ŸŽ ๐†๐/๐ฌ across 16 nodes.

If you've ever wondered why your training runs are slower than they should be, or you're planning to scale up and want to avoid expensive mistakes, this guide might save you weeks of debugging.

๐“๐ก๐ž ๐’๐ฆ๐จ๐ฅ ๐“๐ซ๐š๐ข๐ง๐ข๐ง๐  ๐๐ฅ๐š๐ฒ๐›๐จ๐จ๐ค: https://lnkd.in/e5MKXUHS

Shared with โค๏ธ by the HuggingFace team
merveย 
posted an update 7 months ago
view post
Post
12313
deepseek-ai/DeepSeek-OCR is out! ๐Ÿ”ฅ my take โคต๏ธ
> pretty insane it can parse and re-render charts in HTML
> it uses CLIP and SAM features concatenated, so better grounding
> very efficient per vision tokens/performance ratio
> covers 100 languages
  • 4 replies
ยท
merveย 
posted an update 8 months ago
view post
Post
7015
large AI labs open-sourced a ton of models last week ๐Ÿ”ฅ
here's few picks, find even more here merve/sep-16-releases-68d13ea4c547f02f95842f05 ๐Ÿค
> IBM released a new Docling model with 258M params based on Granite (A2.0) ๐Ÿ“ ibm-granite/granite-docling-258M
> Xiaomi released 7B audio LM with base and instruct variants (MIT) XiaomiMiMo/mimo-audio-68cc7202692c27dae881cce0
> DecartAI released Lucy Edit, open Nano Banana ๐ŸŒ (NC) decart-ai/Lucy-Edit-Dev
> OpenGVLab released a family of agentic computer use models (3B/7B/32B) with the dataset ๐Ÿ’ป OpenGVLab/scalecua-68c912cf56f7ff4c8e034003
> Meituan Longcat released thinking version of LongCat-Flash ๐Ÿ’ญ meituan-longcat/LongCat-Flash-Thinking
  • 2 replies
ยท
merveย 
posted an update 8 months ago
view post
Post
3539
IBM just released small swiss army knife for the document models: granite-docling-258M on Hugging Face ๐Ÿ”ฅ

> not only a document converter but also can do document question answering, understand multiple languages ๐Ÿคฏ
> best part: released with Apache 2.0 license ๐Ÿ‘ use it with your commercial projects!
> it supports transformers, vLLM and MLX from the get-go! ๐Ÿค—
> built on SigLIP2 & granite-165M

model: ibm-granite/granite-docling-258M
demo: ibm-granite/granite-docling-258m-demo ๐Ÿ’—
merveย 
posted an update 8 months ago
view post
Post
1291
a ton of image/video generation models and LLMs from big labs ๐Ÿ”ฅ

> Meta released facebook/mobilellm-r1-68c4597b104fac45f28f448e, smol LLMs for on-device use ๐Ÿ’ฌ
> Tencent released tencent/SRPO, high res image generation model and tencent/POINTS-Reader, cutting edge OCR ๐Ÿ“
> ByteDance released bytedance-research/HuMo, video generation from any input โฏ๏ธ

find more models, datasets, demos here merve/sep-11-releases-68c7dbfa26bea8cd921fa0ac
merveย 
posted an update 8 months ago
view post
Post
1081
fan-favorite vision LM Florence-2 is now officially supported in transformers ๐Ÿค—

find all the models in
florence-community
org ๐Ÿซก
ariG23498ย 
posted an update 8 months ago
view post
Post
2231
New post is live!

This time we cover some major updates to transformers.

๐Ÿค—
  • 2 replies
ยท
merveย 
posted an update 8 months ago
merveย 
posted an update 8 months ago
merveย 
posted an update 9 months ago
view post
Post
6328
large AI labs have dropped so many open models last week ๐Ÿ”ฅ don't miss out on them

โ†’ Apple released on-device vision LMs apple/fastvlm-68ac97b9cd5cacefdd04872e & apple/mobileclip2-68ac947dcb035c54bcd20c47
โ†’ OpenGVLab released InternVL3.5, 32 new vision LMs with one based on gpt-oss! (OS) OpenGVLab/internvl35-68ac87bd52ebe953485927fb
โ†’ MSFT released a killer small TTS model (OS) microsoft/VibeVoice-1.5B

find more herehttps://huggingface.co/collections/merve/august-29-releases-68b5a3754cfb8abf59e2b486
  • 1 reply
ยท
merveย 
posted an update 9 months ago
view post
Post
6117
first vision language model built off openai/gpt-oss-20b just dropped! ๐Ÿ”ฅ

InternVL3.5 comes with 32 models ๐Ÿคฏ pre-trained, fine-tuned, aligned in various sizes OpenGVLab/internvl35-68ac87bd52ebe953485927fb
comes with gpt-oss or Qwen3 for LLM part โคต๏ธ
  • 1 reply
ยท
merveย 
posted an update 9 months ago
view post
Post
3353
GPT-4.1-mini level model right in your iPhone ๐Ÿคฏ

openbmb/MiniCPM-V-4 is only 4B while surpassing GPT-4.1-mini in vision benchmarks ๐Ÿ”ฅ

allows commercial use as well!