|
--- |
|
extra_gated_heading: | |
|
Hi, your request will be fast-approved if you: |
|
(1) Complete all form fields in full detail. |
|
(2) Clearly demonstrate your project's significance, including: used and target product, economic benefit. (Commercial use cases receive highest priority) |
|
|
|
extra_gated_description: | |
|
Approval time are prioritized based on project impact. Submissions for high-value commercial applications typically receive review within 48 hours. Additionally, we will consider sharing the SageAttention3 code with significant projects later. |
|
|
|
extra_gated_fields: |
|
"Full Name": |
|
type: text |
|
required: true |
|
"User Type (Corporate/Organization are welcome)": |
|
type: select |
|
required: true |
|
options: |
|
- "Corporate/Organization User" |
|
- "Individual User" |
|
"Email (please use Institutional Email)": |
|
type: text |
|
required: true |
|
"Country/Region": |
|
type: country |
|
required: true |
|
"Your Organization and Department": |
|
type: text |
|
required: true |
|
"Which Product will you use the Code for? Estimate the speedup and the economic USD benefit. (Commercial cases are very welcome. Please introduce in detail)": |
|
type: text |
|
required: true |
|
"Which of your products have you used SageAttention? Report the speedup and estimate the economic USD benefit. (Commercial cases are very welcome. Please introduce in detail)": |
|
type: text |
|
required: true |
|
--- |
|
|
|
--- |
|
license: apache-2.0 (Commercial applications are also allowed!) |
|
--- |
|
|
|
# SageAttention |
|
<!-- We are continuously updating more features. You could **Star** and **Watch** our repository to stay updated. |
|
|
|
--- --> |
|
This repository provides the official implementation of SageAttention, SageAttention2, and **SageAttention2++**, which achieve surprising speedup on most GPUs without lossing accuracy across all models in a plug-and-play way. |
|
|
|
**SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration** |
|
Paper: https://arxiv.org/abs/2410.02367 |
|
Jintao Zhang, Jia Wei, Haofeng Huang, Pengle Zhang, Jun Zhu, Jianfei Chen |
|
|
|
**SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization** |
|
Paper: https://arxiv.org/abs/2411.10958 |
|
Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, Jianfei Chen |
|
|
|
**SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training** |
|
Paper: https://arxiv.org/abs/2505.11594 |
|
Jintao Zhang, Jia Wei, Pengle Zhang, Xiaoming Xu, Haofeng Huang, Haoxu Wang, Kai Jiang, Jun Zhu, Jianfei Chen |
|
|
|
|
|
## Installation |
|
### Base environment |
|
+ `python>=3.9` , `torch>=2.3.0` , `triton>=3.0.0` |
|
- `CUDA`: |
|
+ **`>=12.8` for Blackwell and SageAttention2++** |
|
+ `>=12.4` for fp8 support on Ada |
|
+ `>=12.3` for fp8 support on Hopper |
|
+ `>=12.0` for Ampere |
|
+ `flash-attn` for benchmarking |
|
|
|
### Install Package |
|
|
|
To use SageAttention 2.2.0 (SageAttention2++ contained), please **compile from source**: |
|
``` |
|
git clone https://github.com/thu-ml/SageAttention.git |
|
cd sageattention |
|
python setup.py install # or pip install -e . |
|
``` |
|
|
|
To benchmark the speed against FlashAttention3, please compile FlashAttention3 from source: |
|
``` |
|
git clone https://github.com/Dao-AILab/flash-attention.git --recursive |
|
git checkout b7d29fb3b79f0b78b1c369a52aaa6628dabfb0d7 # 2.7.2 release |
|
cd hopper |
|
python setup.py install |
|
``` |
|
|
|
## How to Use |
|
**Note that the default API is already SageAttention2++, corresponding to `_qattn_sm89.qk_int8_sv_f8_accum_f16_fuse_v_scale_attn_inst_buf`** |
|
```python |
|
from sageattention import sageattn |
|
attn_output = sageattn(q, k, v, tensor_layout="HND", is_causal=False) |
|
``` |
|
+ `q, k, v` are **FP16/BF16** dtype with the shape `(batch_size, head_num, seq_len, head_dim)` using default `tensor_layout="HND"`. For shape `(batch_size, seq_len, head_num, head_dim)`, set `tensor_layout="NHD"`. |
|
+ `is_causal` determines the use of a causal mask. |
|
|
|
### Available APIs: |
|
+ `sageattn`: Automatically selects the optimal kernel based on the GPU to achieve a good performance-accuracy trade-off. |
|
+ `sageattn_qk_int8_pv_fp16_triton`: INT8 quantization for $QK^\top$ and FP16 for $PV$ using Triton backend. |
|
+ `sageattn_qk_int8_pv_fp16_cuda`: INT8 quantization for $QK^\top$ and FP16 for $PV$ using CUDA backend. |
|
+ `sageattn_qk_int8_pv_fp8_cuda`: INT8 quantization for $QK^\top$ and FP8 for $PV$ using CUDA backend. (the default API is already SageAttention2++) |
|
+ `sageattn_qk_int8_pv_fp8_cuda_sm90`: INT8 quantization for $QK^\top$ and FP8 for $PV$ using CUDA backend, specifically optimized for Hopper GPUs. |
|
+ `sageattn_varlen`: INT8 quantization for $QK^\top$ and FP16 for $PV$ using Triton backend. Support for varying sequence lengths within the same batch. |
|
|
|
For optimal speed and accuracy performance on custom devices and models, we strongly recommend referring to the [this file](./sageattention/core.py) for detailed guidance. |
|
|
|
> **Note:** |
|
Support for different sequence lengths between `q` and `k,v` and `group-query attention` is available. |
|
|
|
|
|
### Plug-and-play Example |
|
|
|
> **Note:** Not all models works with `F.scaled_dot_product_attention = sageattn`. Technically, you should replace the original Attention by modifying the `Attention Class` of the target model. For image and video models, we suggest only replacing the attention in DiT (see `example/mochi.py` for detail). |
|
|
|
### Kernel Benchmarking |
|
We provide a benchmarking script to compare the speed of different kernels including SageAttention, FlashAttention2 and FlashAttention3. Please refer to the `benchmark/` directory for more details. |
|
|
|
## Performance |
|
### Speed of Kernels |
|
|
|
`8+8` means the kernel with INT8 quantization for $QK^\top$ and FP8 quantization for $PV$. `8+16` uses FP16 with FP16 accumulator for $PV$. |
|
|
|
 |
|
|
|
 |
|
|
|
 |
|
|
|
 |
|
|
|
 |
|
|
|
 |
|
|
|
 |
|
|
|
> **Note:** The TOPS results refer only to the Attention Kernel, excluding the quantization and smoothing. |
|
|
|
### End-to-end Performance |
|
#### **End-to-End Accuracy:** |
|
|
|
 |
|
|
|
 |
|
|
|
 |
|
|
|
 |
|
|
|
|
|
#### **End-to-End Speedup:** |
|
|
|
 |
|
|
|
|
|
## Citation |
|
**If you use this code or find our work valuable, please cite:** |
|
``` |
|
@inproceedings{zhang2025sageattention, |
|
title={SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration}, |
|
author={Zhang, Jintao and Wei, Jia and Zhang, Pengle and Zhu, Jun and Chen, Jianfei}, |
|
booktitle={International Conference on Learning Representations (ICLR)}, |
|
year={2025} |
|
} |
|
@inproceedings{zhang2024sageattention2, |
|
title={Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization}, |
|
author={Zhang, Jintao and Huang, Haofeng and Zhang, Pengle and Wei, Jia and Zhu, Jun and Chen, Jianfei}, |
|
booktitle={International Conference on Machine Learning (ICML)}, |
|
year={2025} |
|
} |
|
@article{zhang2025sageattention3, |
|
title={SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training}, |
|
author={Zhang, Jintao and Wei, Jia and Zhang, Pengle and Xu, Xiaoming and Huang, Haofeng and Wang, Haoxu and Jiang, Kai and Zhu, Jun and Chen, Jianfei}, |
|
journal={arXiv preprint arXiv:2505.11594}, |
|
year={2025} |
|
} |
|
@article{zhang2025sageattention2++, |
|
title={SageAttention2++: A More Efficient Implementation of SageAttention2}, |
|
author={Zhang, Jintao and Xu, Xiaoming and Wei, Jia and Huang, Haofeng and Zhang, Pengle and Xiang, Chendong and Zhu, Jun and Chen, Jianfei}, |
|
journal={arXiv preprint arXiv:2505.21136}, |
|
year={2025} |
|
} |
|
``` |
|
|