Papers
arxiv:2507.03117

BLaST: High Performance Inference and Pretraining using BLock Sparse Transformers

Published on Jul 3
Authors:
,
,
,
,
,

Abstract

BLaST is a sparsification method for linear layers that achieves high sparsity with minimal accuracy loss, leading to significant speedups and memory reductions in large-scale ML models.

AI-generated summary

The energy consumption of large-scale ML models is dominated by data movement - shuffling billions of parameters across memory hierarchies and data centers. Effective sparsification to prune redundant parameters is still challenging: existing methods incur significant accuracy degradation, performance overhead, or both. We introduce (Bl)ock (a)nd (S)parse (T)ransformers (BLaST), a general, robust, and reliable sparsification method applicable to linear layers in all settings. Our method iteratively sparsifies weight matrices into a block sparsity pattern suitable for efficient sparse matrix-matrix (SpMM) multiplication. BLaST achieves up to 95% sparsity in MLP weights with negligible accuracy loss. Our fused, highly optimized Sparse MLP kernel delivers up to 16.7x speedup over dense MLPs across 9 architectures and 8 datasets, resulting in up to 1.6x inference speedup, 1.11x pretraining speedup and up to 3.12x inference memory usage reduction. BLaST enables the next generation of large-scale AI systems by reducing energy use, memory footprint, and latency.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.03117 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.03117 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.03117 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.