Kimi K2 tech report is full of gems as always. Here are my notes on it:
> MuonClip: Pretty crazy how after 70k the training stabilizes and the QK-clip is basically inactive. There is also no loss in perf with QK-clip which is not trivial at all (at small scale but with aggressive threshold). Also a cool explanation of why muon makes the logit explode in appendix E (tl;dr is that muon makes the singular value of the update matrix higher) > Sparsity scaling laws to justify their ratio, they have a very solid training infra that allows the model to be trained at this sparsity level, they could have increased even more but as sparsity increases the training becomes less efficient. > They diminish the number of attention heads to make it more efficient for long context since attention heads are a big bottleneck for long context. They also remove 2 of the 3 "first dense" layers in the dsv3 arch.
With the sparsity and attention heads (divided by 2) they achieve 83% increased flops compared to deepseek v3 arch at 128k.
> Data: Rephrasing is KEY. They do a lot more synthetic data generation and rephrase their corpus to have different styles, for longer documents they do it by chunk. I'm (half) surprised by the fact that ONLY 1 epoch (assuming same number of training tokens I think?) of data rephrased 10 times has better accuracy than 10 epochs of the same data rephrased once. > They do rewriting for Math and Knowledge, for Math they apply the ShallowMath recipe and instruct the model to rephrase in a "learning note" style > They talk about diversity and probably have some internal stuff/eval to test that, as always still a bit unclear for me how to properly measure that.
The infra is also very nice, quick summary: > PP=16 (1F1B schedule, a bit custom), EP=16, zero1 > No FP8 computation but for storage of specific layers, selective recomputation for inexpensive block, activation offloading to CPU
reacted to danieldk's
post with π€about 1 month ago
We have been working on a project called kernels. kernels makes it possible to load compute kernels directly from the Hub! π
We plan to give kernels a more proper introduction soon. But for those who have been following along, we are happy to announce a new release:
- New layer API with torch.compile support. - Experimental support for loading Apple Silicon Metal π€ Kernels. - Generate wheels from Hub kernels for legacy deployments.
Qwen2.5-Omni is soooo good that people build multimodal reasoning models off of it π₯Ή > KE-Team/Ke-Omni-R-3B is open-source audio reasoning model sota on average of benchmarks, based on Qwen/Qwen2.5-Omni-3B π£οΈ > Haoz0206/Omni-R1 is a video reasoning model with pixel level grounding (see below) and it's super competitive β―οΈ based on Qwen/Qwen2.5-Omni-7B
reacted to danieldk's
post with π₯about 2 months ago
We have been working on a project called kernels. kernels makes it possible to load compute kernels directly from the Hub! π
We plan to give kernels a more proper introduction soon. But for those who have been following along, we are happy to announce a new release:
- New layer API with torch.compile support. - Experimental support for loading Apple Silicon Metal π€ Kernels. - Generate wheels from Hub kernels for legacy deployments.
π ZeroGPU medium size is now available as a power-user feature
Nothing too fancy for nowβZeroGPU Spaces still default to large (70GB VRAM)βbut this paves the way for: - π° size-based quotas / pricing (medium will offer significantly more usage than large) - 𦣠the upcoming xlarge size (141GB VRAM)
You can as of now control GPU size via a Space variable. Accepted values: - auto (future default) - medium - large (current default)
The auto mode checks total CUDA tensor size during startup: - More than 30GB β large - Otherwise β medium
We are thrilled to announce Jamba, the worldβs first production-grade Mamba based model.
Key Features: - First production-grade Mamba based model built on a novel SSM-Transformer hybrid architecture - 3X throughput on long contexts compared to Mixtral 8x7B - Democratizes access to a massive 256K context window - The only model in its size class that fits up to 140K context on a single GPU
Jamba is based on a novel architecture that combines Mamba and Transformer. While our initial results show great efficiency gains, we expect this to be further explored and improved with the help of the community.
Hey, I'll be presenting @retrain-pipelines and almighty function-calling at the Hugging Face Paris HQ, you guys. Monday evening. Lightning-talk style. With AI Tinkerers.
How would you like to build smart GenAi infrastructure ? Give extensive tools memory to your edge agentic system, And optimize the resources it takes to run yet a high-performance set of agents ?
We came up with a novel approach to function-calling at scale for smart companies and corporate-grade use-cases.
π¦₯ Introducing Unsloth Dynamic v2.0 GGUFs! Our v2.0 quants set new benchmarks on 5-shot MMLU and KL Divergence, meaning you can now run & fine-tune quantized LLMs while preserving as much accuracy as possible.
We made selective layer quantization much smarter. Instead of modifying only a subset of layers, we now dynamically quantize all layers so every layer has a different bit. Now, our dynamic method can be applied to all LLM architectures, not just MoE's.
All our future GGUF uploads will leverage Dynamic 2.0 and our hand curated 300Kβ1.5M token calibration dataset to improve conversational chat performance.
For accurate benchmarking, we built an evaluation framework to match the reported 5-shot MMLU scores of Llama 4 and Gemma 3. This allowed apples-to-apples comparisons between full-precision vs. Dynamic v2.0, QAT and standard iMatrix quants.
Dynamic v2.0 aims to minimize the performance gap between full-precision models and their quantized counterparts.
retrain-pipelines 0.1.2 finally dropped. It comes with a hot Hugging Face Hub integration. Go check it out. We have 2 articles about it coming up. One already fully written so, be on the lookout ! @retrain-pipelines
Also, I'll be volunteering at GOSIM AI Paris 2025. If you're interested in chatting, hmu.
What does it mean when models share the same bytes?
We've investigated some quants and have seen that a considerable portion of quantizations of the same model share the same bytes and can be deduplicated to save considerable upload time for quantizers on the Hub.
Since going into production the xet-team has migrated hundreds of repositories on the Hub to our storage layer, including classic "pre-Hub" open-source models like FacebookAI/xlm-roberta-large (XLM-R) from FacebookAI
XLM-R, introduced in 2019, set new benchmarks for multilingual NLP by learning shared representations across 100 languages. It was then fine-tuned on English, Spanish, Dutch, and German, generating language-specific derivations for each - check out the paper here Unsupervised Cross-lingual Representation Learning at Scale (1911.02116)
These finetunes share much of the same architecture and layout as XLM-R with similar training methods and goals. It makes sense that they would share bytes, but it's still fascinating to see.
We put together a similar space to explore these models to see where they overlap - check it out for yourself xet-team/finetune-dedupe
The darker each block in the heatmap, the more the bytes are shared. Clicking on a repos blocks shows all other repos that share blocks.
If you've been following along with the Xet Team's (xet-team) work, you know we've been working to migrate the Hugging Face Hub from Git LFS and to Xet.
Recently, we launched a waitlist to join the movement to Xet (join here! https://huggingface.co/join/xet ) but getting to this point was a journey.
From the initial proof of concept in August, to launching on the Hub internally, to migrating a set of repositories and routing a small chunk of download traffic on the Hub through our infrastructure. Every step of the way has been full of challenges, big and small, and well worth the effort.
Over the past few weeks, with real traffic flowing through our services weβve tackled some truly gnarly issues (unusual upload/download patterns, memory leaks, load imbalances, and more) and resolved each without major disruptions.
If you're curious about how this sliver of Hub infrastructure looks as we routed traffic through it for the first time (and want a deep dive full of Grafana and Kibana charts π€) I have a post for you.
Here's an inside look into the day of our first migrations and the weeks following, where we pieced together solutions in real time.