Spaces:
Sleeping
Apply for community grant: Academic project (gpu)
Here I would like to use HF machines to train and push a Qwen 3 8b pre-trained model built using my WERSA attention mechanism and revolutionize the edge AI computing with a blazing fast and accurate attention mechanism.
Here is my published paper: https://arxiv.org/abs/2507.08637
The Problem: The Quadratic Wall of Long-Context AI
The single greatest barrier to truly long-context language models is the quadratic (O(n²)) complexity of the standard attention mechanism. This "quadratic wall" makes training models on document-length or book-length sequences computationally prohibitive, requiring massive GPU clusters and excluding most of the community from cutting-edge research and development. While optimized methods like FlashAttention-2 mitigate this, they remain fundamentally quadratic and fail on truly long sequences, hitting an inevitable Out-Of-Memory (OOM) wall. This bottleneck stifles innovation and concentrates state-of-the-art long-context capabilities in the hands of a few large labs.
My Solution: WERSA - Linear-Time Attention Without Compromise
This project introduces Wavelet-Enhanced Random Spectral Attention (WERSA), a novel attention mechanism with linear (O(n)) time complexity that definitively breaks the quadratic barrier. As detailed in our paper WERSA is not just a theoretical curiosity; it is a practical and proven solution.
By merging multi-resolution Haar wavelets with content-adaptive random spectral features, WERSA achieves a unique combination of efficiency and performance. Our extensive benchmarks show that WERSA consistently matches or exceeds the accuracy of standard attention while being dramatically more efficient.
81% Faster Training: On ArXiv classification, WERSA reduced training time from 1554s to just 296s.
73% Fewer FLOPS: It achieved this speedup with a 73.4% reduction in computational cost.
SOTA on Extreme Contexts: On the ArXiv-128k benchmark, where both standard attention and FlashAttention-2 fail with OOM errors, WERSA achieves the highest accuracy (79.1%) and is twice as fast as its closest linear-time competitor.
Project Goals & Community Benefit
This grant will fund the pre-training and release of a state-of-the-art 8-billion parameter, Qwen-style language model built with WERSA attention. This is not just another model; it is a demonstration of a new paradigm for accessible, efficient AI.
The primary benefits to the Hugging Face community will be:
A Powerful, Open-Source Long-Context Model: We will provide the community with a highly capable 8B parameter model, pre-trained on a diverse dataset, that can handle context lengths far beyond what is typical for its size.
A Production-Ready Linear Attention Implementation: We will release the wersa package as a robust, well-documented, and easy-to-use building block. This will empower any developer to build their own memory-efficient models for tasks like RAG, document summarization, and creative writing.
Democratizing Research on a Shoestring Budget: Our paper proves that SOTA results can be achieved on a single GPU. By releasing this model and the underlying code, we will enable researchers and developers with limited hardware to experiment with, fine-tune, and innovate on long-context architectures, fostering a more inclusive and sustainable AI ecosystem.
Execution Plan
The grant will be used to secure the necessary compute resources for a comprehensive pre-training run of the 8B WERSA model. Our plan is as follows:
1 - Finalize the wersa Package: Polish the package with extensive documentation and tutorials. (almost done see https://huggingface.co/vincenzodentamaro/wersa)
2 - Curate a Pre-training Dataset: Prepare a high-quality, diverse dataset by combining open-source corpora.
3 - Execute the Pre-training Run: Train the 8B WERSA model to convergence.
4 - Release and Promote: Publish the final model, the wersa package, and a live demo on the Hugging Face Hub, accompanied by a blog post and community outreach to ensure maximum visibility and adoption.
I believe WERSA represents a significant step forward in making powerful AI more accessible, affordable, and scalable. We are excited about the opportunity to bring this technology to the Hugging Face community.
no answers yet