✨ 20.3B / 3B active - MoE ✨ SOTA video understanding via 3D MRoPE + curriculum learning ✨ Real time speech synthesis + dialect support ✨ Enhanced multimodal generation with ID & scene consistency
✨ Highly Customizable: Supports custom terms, domain prompts, and translation memory for accurate, context-aware results. ✨ Fast and affordable: $0.5 per million tokens.
Introducing Voxtral WebGPU: State-of-the-art audio transcription directly in your browser! 🤯 🗣️ Transcribe videos, meeting notes, songs and more 🔐 Runs on-device, meaning no data is sent to a server 🌎 Multilingual (8 languages) 🤗 Completely free (forever) & open source
That's right, we're running Mistral's new Voxtral-Mini-3B model 100% locally in-browser on WebGPU, powered by Transformers.js and ONNX Runtime Web! 🔥
Fast LoRA inference for Flux with Diffusers and PEFT 🚨
There are great materials that demonstrate how to optimize inference for popular image generation models, such as Flux. However, very few cover how to serve LoRAs fast, despite LoRAs being an inseparable part of their adoption.
In our latest post, @BenjaminB and I show different techniques to optimize LoRA inference for the Flux family of models for image generation. Our recipe includes the use of:
1. torch.compile 2. Flash Attention 3 (when compatible) 3. Dynamic FP8 weight quantization (when compatible) 4. Hotswapping for avoiding recompilation during swapping new LoRAs 🤯
We have tested our recipe with Flux.1-Dev on both H100 and RTX 4090. We achieve at least a *2x speedup* in either of the GPUs. We believe our recipe is grounded in the reality of how LoRA-based use cases are generally served. So, we hope this will be beneficial to the community 🤗
Even though our recipe was tested primarily with NVIDIA GPUs, it should also work with AMD GPUs.
🧑🍳 In this @HuggingFace Face Cookbook notebook, we demonstrate how to align a multimodal model (VLM) using Mixed Preference Optimization (MPO) using trl.
💡 This recipe is powered by the new MPO support in trl, enabled through a recent upgrade to the DPO trainer!
We align the multimodal model using multiple optimization objectives (losses), guided by a preference dataset (chosen vs. rejected multimodal pairs).
Many VLMs claim to process hours of video. But can they follow the story?🤔 Today, we introduce TimeScope: The benchmark that separates true temporal understanding from marketing hype. Let's see how much VLMs really understand!⏳
We test three skills that matter for real-world use: 🔎 Localized Retrieval: Find a specific action. 🧩 Information Synthesis: Piece together scattered clues. 🏃 Fine-Grained Perception: Analyze detailed motion (e.g., count how many times a person swings an axe).
The results are in, and they're revealing. Only Gemini 2.5 pro handles 1-hour-long videos. Performance drops sharply with duration, proving that long video understanding is still challenging. We've found the breaking points—now the community can start fixing them.📈
Want to learn more? TimeScope is 100% open-source. Benchmark your model and help us build the next generation of video AI.
✨ 480B total, 35B activated MoE ✨ Agentic Coding + Browser Use → Top code model performance ✨ 256K context (up to 1M via Yarn) for repo-scale understanding
⚡️ In this new @huggingface Cookbook recipe, I walk you though the process of fine tuning a Visual Language Model (VLM) for Object Detection with Visual Grounding, using TRL.
🔍 Object detection typically involves detecting categories in images (e.g., vase).
By combining it with visual grounding, we add contextual understanding so instead of detecting just "vase", we can detect "middle vase" in an image.
VLMs are super powerful!
In this case, I use PaliGemma 2 which already supports object detection and extend it to also add visual grounding.
🤔 Why this matters: When we use "free" online AI services, we're often the product. Our conversations become training data, our personal stories get "cooked into" models, and our privacy becomes a commodity. But there's an alternative path forward.
💡 The power shift is real: Local LLMs aren't just about privacy; they're about redistributing AI power away from a handful of tech giants. When individuals, organizations, and even entire nations can run their own models, we're democratizing access to AI capabilities.
🤗 At Hugging Face, we're proud to be at the center of this transformation. Our platform hosts the world's largest library of freely downloadable models, making cutting-edge AI accessible to everyone -- from researchers and developers to curious individuals who want to experiment on their laptops or even smartphones.
The technical barriers that once required $$$ server racks are crumbling. Today, anyone with basic computer skills can download a model, run it locally, and maintain complete control over their AI interactions. No sudden algorithm changes, no data harvesting, no corporate gatekeeping.
This is about technical convenience, but especially about technological sovereignty. When AI power is concentrated in a few hands, we risk creating new forms of digital dependency. Local models offer a path toward genuine AI literacy and independence.
🚀 The future of AI should be open, accessible, and in the hands of the many, not the few. What are your thoughts on AI democratization? Have you experimented with local models yet?
✨ instruction/reinforcement learning/reward model ✨ Supports 28 languages, bidirectional translation ✨ Optimized for deployment & inference: 7B with mistral architecture ✨ Excels across domains: science, law, finance, literature & more