Apply for community grant: Academic project (gpu)

#1
by vincenzodentamaro - opened

Here I would like to use HF machines to train and push a Qwen 3 8b pre-trained model built using my WERSA attention mechanism and revolutionize the edge AI computing with a blazing fast and accurate attention mechanism.

Here is my published paper: https://arxiv.org/abs/2507.08637

The Problem: The Quadratic Wall of Long-Context AI

The single greatest barrier to truly long-context language models is the quadratic (O(n²)) complexity of the standard attention mechanism. This "quadratic wall" makes training models on document-length or book-length sequences computationally prohibitive, requiring massive GPU clusters and excluding most of the community from cutting-edge research and development. While optimized methods like FlashAttention-2 mitigate this, they remain fundamentally quadratic and fail on truly long sequences, hitting an inevitable Out-Of-Memory (OOM) wall. This bottleneck stifles innovation and concentrates state-of-the-art long-context capabilities in the hands of a few large labs.

My Solution: WERSA - Linear-Time Attention Without Compromise

This project introduces Wavelet-Enhanced Random Spectral Attention (WERSA), a novel attention mechanism with linear (O(n)) time complexity that definitively breaks the quadratic barrier. As detailed in our paper WERSA is not just a theoretical curiosity; it is a practical and proven solution.

By merging multi-resolution Haar wavelets with content-adaptive random spectral features, WERSA achieves a unique combination of efficiency and performance. Our extensive benchmarks show that WERSA consistently matches or exceeds the accuracy of standard attention while being dramatically more efficient.

81% Faster Training: On ArXiv classification, WERSA reduced training time from 1554s to just 296s.

73% Fewer FLOPS: It achieved this speedup with a 73.4% reduction in computational cost.

SOTA on Extreme Contexts: On the ArXiv-128k benchmark, where both standard attention and FlashAttention-2 fail with OOM errors, WERSA achieves the highest accuracy (79.1%) and is twice as fast as its closest linear-time competitor.

Project Goals & Community Benefit

This grant will fund the pre-training and release of a state-of-the-art 8-billion parameter, Qwen-style language model built with WERSA attention. This is not just another model; it is a demonstration of a new paradigm for accessible, efficient AI.

The primary benefits to the Hugging Face community will be:

A Powerful, Open-Source Long-Context Model: We will provide the community with a highly capable 8B parameter model, pre-trained on a diverse dataset, that can handle context lengths far beyond what is typical for its size.

A Production-Ready Linear Attention Implementation: We will release the wersa package as a robust, well-documented, and easy-to-use building block. This will empower any developer to build their own memory-efficient models for tasks like RAG, document summarization, and creative writing.

Democratizing Research on a Shoestring Budget: Our paper proves that SOTA results can be achieved on a single GPU. By releasing this model and the underlying code, we will enable researchers and developers with limited hardware to experiment with, fine-tune, and innovate on long-context architectures, fostering a more inclusive and sustainable AI ecosystem.

Execution Plan

The grant will be used to secure the necessary compute resources for a comprehensive pre-training run of the 8B WERSA model. Our plan is as follows:

1 - Finalize the wersa Package: Polish the package with extensive documentation and tutorials. (almost done see https://huggingface.co/vincenzodentamaro/wersa)

2 - Curate a Pre-training Dataset: Prepare a high-quality, diverse dataset by combining open-source corpora.

3 - Execute the Pre-training Run: Train the 8B WERSA model to convergence.

4 - Release and Promote: Publish the final model, the wersa package, and a live demo on the Hugging Face Hub, accompanied by a blog post and community outreach to ensure maximum visibility and adoption.

I believe WERSA represents a significant step forward in making powerful AI more accessible, affordable, and scalable. We are excited about the opportunity to bring this technology to the Hugging Face community.

no answers yet

Sign up or log in to comment