gg-hf-gu

community
Activity Feed

AI & ML interests

None defined yet.

alvarobarttΒ 
posted an update 3 days ago
view post
Post
3115
Latest hf-mem release added a breakdown of Mixture-of-Experts (MoE) memory usage!

TL; DR MoEs can be misleading to reason about from active parameters alone, since each token only activates a subset of experts, while the serving setup still needs to account for the full resident memory footprint.

🧠 hf-mem now splits MoE memory into base model weights, routed experts, and KV cache
πŸ—οΈ Dense models usually load and use most weights every forward pass, while MoEs load many experts but only route each token to a few of them
⚑ Active params isn't the same as memory footprint, especially for sparse architectures
πŸ“¦ Runtime memory is about what is used per request/token, while loading memory also includes the expert weights that need to be resident
πŸ“š KV cache can still dominate depending on context length, batch size, and concurrency
πŸ”€ Expert Parallelism (EP) helps shard experts across accelerators when expert weights dominate
πŸš€ Data Parallelism (DP) + EP is often a good fit for throughput-oriented MoE serving

Check the repository at https://github.com/alvarobartt/hf-mem
alvarobarttΒ 
posted an update 3 months ago
view post
Post
3732
Learn how to deploy Microsoft Research VibeVoice ASR on Microsoft Azure Foundry with Hugging Face to generate rich audio transcriptions with Who, When, and What! πŸ’₯

> πŸ•’ 60-minute single-pass processing, no chunking or stitching
> πŸ‘€ Customized hotwords to guide recognition on domain-specific content
> πŸ“ Rich transcription: joint ASR + diarization + timestamping in one pass
> 🌍 50+ languages with automatic detection and code-switching support
> πŸ€— Deployed on Microsoft Foundry via an OpenAI-compatible Chat Completions API

https://huggingface.co/docs/microsoft-azure/foundry/examples/deploy-vibevoice-asr
alvarobarttΒ 
posted an update 4 months ago
view post
Post
3256
πŸ’₯ hf-mem v0.4.1 now also estimates KV cache memory requirements for any context length and batch size with the --experimental flag!

uvx hf-mem --model-id ... --experimental will automatically pull the required information from the Hugging Face Hub to include the KV cache estimation, when applicable.

πŸ’‘ Alternatively, you can also set the --max-model-len, --batch-size and --kv-cache-dtype arguments (Γ  la vLLM) manually if preferred.
  • 1 reply
Β·
pcuenqΒ 
posted an update 5 months ago
view post
Post
4913
πŸ‘‰ What happened in AI in 2025? πŸ‘ˆ

We prepared the 2025 version of the HF AI Timeline Grid, highlighting open vs API-based model releases, and allowing you to browse and filter by access, modality, and release type!

Play with it here:
2025-ai-timeline/2025-ai-timeline

Here's my personal quarterly TL;DR:

1️⃣ Q1 β€” Learning to Reason
Deepseek not only releases a top-notch reasoning model, but shows how to train them and compete with closed frontier models. OpenAI debuts Deep Research.

Significant milestones: DeepSeek R1 & R1-Zero, Qwen 2.5 VL, OpenAI Deep Research, Gemini 2.5 Pro (experimental)

2️⃣ Q2 β€” Multimodality and Coding
More LLMs embrace multimodality by default, and there's a surge in coding agents. Strong vision, audio, and generative models emerge.

Significant milestones: Llama 4, Qwen 3, Imagen 4, OpenAI Codex, Google Jules, Claude 4

3️⃣ Q3 β€” "Gold" rush, OpenAI opens up, the community goes bananas
Flagship models get gold in Math olympiads and hard benchmarks. OpenAI releases strong open source models and Google releases the much anticipated nano-banana for image generation and editing. Agentic workflows become commonplace.

Significant milestones: Gemini and OpenAI IMO Gold, gpt-oss, Gemini 2.5 Flash Image, Grok 4, Claude Sonnet 4.5

4️⃣ Q4 β€” Mistral returns, leaderboard hill-climbing
Mistral is back with updated model families. All labs release impressive models to wrap up the year!

Significant milestones: Claude Opus 4.5, DeepSeek Math V2, FLUX 2, GPT 5.1, Kimi K2 Thinking, Nano Banana Pro, GLM 4.7, Gemini 3, Mistral 3, MiniMax M2.1 🀯

Credits
πŸ™ NHLOCAL for the source data https://github.com/NHLOCAL/AiTimeline

🫑 @reach-vb for the original idea, design and recipe

πŸ™Œ @ariG23498 and yours truly for compiling and verifying the 2025 edition

πŸ₯³ Here's to 2026, wishing it becomes the best year ever for open releases and on-device-first use-cases! πŸ₯‚
  • 3 replies
Β·
ngxsonΒ 
posted an update about 1 year ago
view post
Post
6483
A comprehensive matrix for which format should you use.

Read more on my blog post: https://huggingface.co/blog/ngxson/common-ai-model-formats

| Hardware        | GGUF      | PyTorch                | Safetensors              | ONNX  |
|-----------------|-----------|------------------------|--------------------------|-------|
| CPU             | βœ… (best) | 🟑                      | 🟑                       | βœ…    |
| GPU             | βœ…        | βœ…                      | βœ…                       | βœ…    |
| Mobile          | βœ…        | 🟑 (via executorch)     | ❌                       | βœ…    |
| Apple silicon   | βœ…        | 🟑                      | βœ… (via MLX framework)   | βœ…    |
  • 1 reply
Β·
alvarobarttΒ 
posted an update about 1 year ago
view post
Post
3646
πŸ”₯ Agents can do anything! @microsoft Research just announced the release of Magma 8B!

Magma is a new Visual Language Model (VLM) with 8B parameters for multi-modal agents designed to handle complex interactions across virtual and real environments; and it's MIT licensed!

Magma comes with exciting new features such as:
- Introduces the Set-of-Mark and Trace-of-Mark techniques for fine-tuning
- Leverages a large amount of unlabeled video data to learn the spatial-temporal grounding and planning
- A strong generalization and ability to be fine-tuned for other agentic tasks
- SOTA in different multi-modal benchmarks spanning across UI navigation, robotics manipulation, image / video understanding and spatial understanding and reasoning
- Generates goal-driven visual plans and actions for agentic use cases

Model: microsoft/Magma-8B
Technical Report: Magma: A Foundation Model for Multimodal AI Agents (2502.13130)
ngxsonΒ 
posted an update over 1 year ago
ngxsonΒ 
posted an update over 1 year ago
view post
Post
4227
Check out my collection of pre-made GGUF LoRA adapters!

This allow you to use both normal + abliterated version of popular models like llama, qwen, etc, without having to double to amount of VRAM usage.

ngxson/gguf_lora_collection
  • 5 replies
Β·
ngxsonΒ 
posted an update over 1 year ago
alvarobarttΒ 
posted an update almost 2 years ago
view post
Post
3043
πŸ€— Serving Meta Llama 3.1 405B on Google Cloud is now possible via the Hugging Face Deep Learning Containers (DLCs) for Text Generation Inference (TGI)

In this post, we showcase how to deploy https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 on an A3 instance with 8 x H100 GPUs on Vertex AI

Thanks to the Hugging Face DLCs for TGI and Google Cloud Vertex AI, deploying a high-performance text generation container for serving Large Language Models (LLMs) has never been easier. And we’re not going to stop here – stay tuned as we enable more experiences to build AI with open models on Google Cloud!

Read the full post at https://huggingface.co/blog/llama31-on-vertex-ai
alvarobarttΒ 
posted an update about 2 years ago
view post
Post
3282
πŸ”₯ Prometheus 2 was recently released by Kaist AI as an alternative and closely mirroring both human and GPT-4 evaluation, and surpassing the former Prometheus!

prometheus-eval/prometheus-7b-v2.0
prometheus-eval/prometheus-8x7b-v2.0

🌬️Fine-tuned on top of mistralai/Mistral-7B-Instruct-v0.2 and mistralai/Mixtral-8x7B-Instruct-v0.1
πŸ—‚οΈThe datasets used for fine-tuning have been publicly released i.e. prometheus-eval/Feedback-Collection and prometheus-eval/Preference-Collection
🀝🏻Unified LM evaluator for absolute (a single prompt-completion pair) and relative (two completions for a given prompt) due to model merging
❌No longer needs a mandatory reference / golden answer, but can still be provided optionally
πŸ”Surpasses the former version of Prometheus, and has a high correlation with human, GPT-4, and Claude 3 Opus scores when evaluating LMs
πŸ“Apache 2.0 license

Long-story short, an amazing job from Kaist AI bridging the gap with LLM evaluators other than proprietary and bigger models!

This week at Argilla, we decided to add a new task to use Prometheus 2 as an LLM evaluator using distilabel, so we implemented PrometheusEval.

😱 Using PrometheusEval running their 7B variant with vLLM in a single L40 on top of HuggingFaceH4/instruction-dataset, we got the 327 existing prompt-completion pairs evaluated and pushed to the Hub in less than 2 minutes!

Find the generated dataset and the code at distilabel-internal-testing/instruction-dataset-prometheus
  • 1 reply
Β·
pcuenqΒ 
posted an update about 2 years ago
view post
Post
10341
OpenELM in Core ML

Apple recently released a set of efficient LLMs in sizes varying between 270M and 3B parameters. Their quality, according to benchmarks, is similar to OLMo models of comparable size, but they required half the pre-training tokens because they use layer-wise scaling, where the number of attention heads increases in deeper layers.

I converted these models to Core ML, for use on Apple Silicon, using this script: https://gist.github.com/pcuenca/23cd08443460bc90854e2a6f0f575084. The converted models were uploaded to this community in the Hub for anyone that wants to integrate inside their apps: corenet-community/openelm-core-ml-6630c6b19268a5d878cfd194

The conversion was done with the following parameters:
- Precision: float32.
- Sequence length: fixed to 128.

With swift-transformers (https://github.com/huggingface/swift-transformers), I'm getting about 56 tok/s with the 270M on my M1 Max, and 6.5 with the largest 3B model. These speeds could be improved by converting to float16. However, there's some precision loss somewhere and generation doesn't work in float16 mode yet. I'm looking into this and will keep you posted! Or take a look at this issue if you'd like to help: https://github.com/huggingface/swift-transformers/issues/95

I'm also looking at optimizing inference using an experimental kv cache in swift-transformers. It's a bit tricky because the layers have varying number of attention heads, but I'm curious to see how much this feature can accelerate performance in this model family :)

Regarding the instruct fine-tuned models, I don't know the chat template that was used. The models use the Llama 2 tokenizer, but the Llama 2 chat template, or the default Alignment Handbook one that was used to train, are not recognized. Any ideas on this welcome!
  • 5 replies
Β·
alvarobarttΒ 
posted an update about 2 years ago
view post
Post
2783
🦫 We have just released argilla/Capybara-Preferences in collaboration with Kaist AI (@JW17 , @nlee-208 ) and Hugging Face (@lewtun )

A new synthetic preference dataset built using distilabel on top of the awesome LDJnr/Capybara from @LDJnr

The current dataset combines the already generated alternative completions from argilla/distilabel-capybara-dpo-7k-binarized, while also adding the remaining ones using the same approach!

Here are some key features on how we built it:

- 🧹 Duplicate removal, keeping the conversation besides the last assistant response, and some slight pre-processing

- πŸ€– Generation of alternative completions for the existing conversations (last turn only) with: mlabonne/NeuralBeagle14-7B, argilla/notus-7b-v1, and teknium/OpenHermes-2.5-Mistral-7B

- πŸ‘¨πŸ»β€πŸ« Running UltraFeedback via GPT-4 to generate the critique i.e. ratings and rationales, for the last assistant responses

- πŸŽ‰ Finally, we selected the chosen and rejected responses based on their UltraFeedback score, and applied some slight post-processing!

Sounds simple right? Start building your own synthetic datasets with https://github.com/argilla-io/distilabel already!