Why Not α-Entmax? — A Learnable Sparse Alternative to Softmax in Attention
Why Not α-Entmax? — A Learnable Sparse Alternative to Softmax in Attention
Hello Hugging Face Team,
First of all, thank you so much for your continued efforts in advancing open-source large language models and making cutting-edge tools accessible to everyone.
I'd like to humbly propose an idea that could potentially improve interpretability and efficiency in attention mechanisms: using α-Entmax as a learnable replacement for softmax.
🔍 Motivation
Softmax has long been the default for attention normalization, but it generates dense distributions where all tokens receive non-zero weight. This may reduce interpretability and lead to unnecessary computation.
On the other hand, sparse alternatives—such as sparsemax and entmax—allow for zeroing out irrelevant tokens while maintaining differentiability.
🌱 What is α-Entmax?
- It generalizes softmax via a parameter
α
:- α = 1 → softmax
- α = 2 → sparsemax
- α ∈ (1, 2) → differentiable sparse distributions
- A learnable α allows the model to control its sparsity dynamically—potentially per head, layer, or input.
This flexibility could enable the model to:
- Suppress noise from irrelevant tokens
- Improve long-range dependency modeling
- Introduce structured sparsity without losing differentiability
🧠 Inspiration: Interleaving Sparse and Dense (like iRoPE)
Inspired by ideas such as iRoPE in LLaMA 4 (interleaving RoPE and No-PE), a similar pattern might be applied to attention normalization:
- Some layers/heads use softmax (for stability)
- Others use entmax (for sparsity and interpretability)
- Or even better: a learnable
α
across heads or layers, chosen by the model
🧪 What I’m Exploring
I’m currently experimenting with:
- Replacing softmax in attention layers with α-Entmax
- Making α learnable (initialized near 1.5) for dynamic sparsity
- Comparing output sparsity and evaluation metrics on public datasets
🙏 Request for Feedback
I will continue exploring the feasibility of replacing softmax with α-Entmax in attention layers. However, as an individual without access to large-scale compute resources, I'm still unsure about the best way to contribute this idea formally.
I’m currently considering fine-tuning existing models by replacing softmax with α-Entmax and evaluating the results on public benchmark datasets to compare their performance.
Any suggestions, concerns, or guidance from the Hugging Face community would be greatly appreciated!
Thank you very much for your time and consideration!