Update README.md

dbc3a13 verified 2 months ago

7.77 kB

	---
	extra_gated_heading: \|
	Hi, your request will be fast-approved if you:
	(1) Complete all form fields in full detail.
	(2) Clearly demonstrate your project's significance, including: used and target product, economic benefit. (Commercial use cases receive highest priority)

	extra_gated_description: \|
	Approval time are prioritized based on project impact. Submissions for high-value commercial applications typically receive review within 48 hours. Additionally, we will consider sharing the SageAttention3 code with significant projects later.

	extra_gated_fields:
	"Full Name":
	type: text
	required: true
	"User Type (Corporate/Organization are welcome)":
	type: select
	required: true
	options:
	- "Corporate/Organization User"
	- "Individual User"
	"Email (please use Institutional Email)":
	type: text
	required: true
	"Country/Region":
	type: country
	required: true
	"Your Organization and Department":
	type: text
	required: true
	"Which Product will you use the Code for? Estimate the speedup and the economic USD benefit. (Commercial cases are very welcome. Please introduce in detail)":
	type: text
	required: true
	"Which of your products have you used SageAttention? Report the speedup and estimate the economic USD benefit. (Commercial cases are very welcome. Please introduce in detail)":
	type: text
	required: true
	---

	---
	license: apache-2.0 (Commercial applications are also allowed!)
	---

	# SageAttention
	<!-- We are continuously updating more features. You could Star and Watch our repository to stay updated.

	--- -->
	This repository provides the official implementation of SageAttention, SageAttention2, and SageAttention2++, which achieve surprising speedup on most GPUs without lossing accuracy across all models in a plug-and-play way.

	SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration
	Paper: https://arxiv.org/abs/2410.02367
	Jintao Zhang, Jia Wei, Haofeng Huang, Pengle Zhang, Jun Zhu, Jianfei Chen

	SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization
	Paper: https://arxiv.org/abs/2411.10958
	Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, Jianfei Chen

	SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training
	Paper: https://arxiv.org/abs/2505.11594
	Jintao Zhang, Jia Wei, Pengle Zhang, Xiaoming Xu, Haofeng Huang, Haoxu Wang, Kai Jiang, Jun Zhu, Jianfei Chen


	## Installation
	### Base environment
	+ `python>=3.9` , `torch>=2.3.0` , `triton>=3.0.0`
	- `CUDA`:
	+ `>=12.8` for Blackwell and SageAttention2++
	+ `>=12.4` for fp8 support on Ada
	+ `>=12.3` for fp8 support on Hopper
	+ `>=12.0` for Ampere
	+ `flash-attn` for benchmarking

	### Install Package

	To use SageAttention 2.2.0 (SageAttention2++ contained), please compile from source:
	```
	git clone https://github.com/thu-ml/SageAttention.git
	cd sageattention
	python setup.py install # or pip install -e .
	```

	To benchmark the speed against FlashAttention3, please compile FlashAttention3 from source:
	```
	git clone https://github.com/Dao-AILab/flash-attention.git --recursive
	git checkout b7d29fb3b79f0b78b1c369a52aaa6628dabfb0d7 # 2.7.2 release
	cd hopper
	python setup.py install
	```

	## How to Use
	Note that the default API is already SageAttention2++, corresponding to `_qattn_sm89.qk_int8_sv_f8_accum_f16_fuse_v_scale_attn_inst_buf`
	```python
	from sageattention import sageattn
	attn_output = sageattn(q, k, v, tensor_layout="HND", is_causal=False)
	```
	+ `q, k, v` are FP16/BF16 dtype with the shape `(batch_size, head_num, seq_len, head_dim)` using default `tensor_layout="HND"`. For shape `(batch_size, seq_len, head_num, head_dim)`, set `tensor_layout="NHD"`.
	+ `is_causal` determines the use of a causal mask.

	### Available APIs:
	+ `sageattn`: Automatically selects the optimal kernel based on the GPU to achieve a good performance-accuracy trade-off.
	+ `sageattn_qk_int8_pv_fp16_triton`: INT8 quantization for $QK^\top$ and FP16 for $PV$ using Triton backend.
	+ `sageattn_qk_int8_pv_fp16_cuda`: INT8 quantization for $QK^\top$ and FP16 for $PV$ using CUDA backend.
	+ `sageattn_qk_int8_pv_fp8_cuda`: INT8 quantization for $QK^\top$ and FP8 for $PV$ using CUDA backend. (the default API is already SageAttention2++)
	+ `sageattn_qk_int8_pv_fp8_cuda_sm90`: INT8 quantization for $QK^\top$ and FP8 for $PV$ using CUDA backend, specifically optimized for Hopper GPUs.
	+ `sageattn_varlen`: INT8 quantization for $QK^\top$ and FP16 for $PV$ using Triton backend. Support for varying sequence lengths within the same batch.

	For optimal speed and accuracy performance on custom devices and models, we strongly recommend referring to the [this file](./sageattention/core.py) for detailed guidance.

	> Note:
	Support for different sequence lengths between `q` and `k,v` and `group-query attention` is available.


	### Plug-and-play Example

	> Note: Not all models works with `F.scaled_dot_product_attention = sageattn`. Technically, you should replace the original Attention by modifying the `Attention Class` of the target model. For image and video models, we suggest only replacing the attention in DiT (see `example/mochi.py` for detail).

	### Kernel Benchmarking
	We provide a benchmarking script to compare the speed of different kernels including SageAttention, FlashAttention2 and FlashAttention3. Please refer to the `benchmark/` directory for more details.

	## Performance
	### Speed of Kernels

	`8+8` means the kernel with INT8 quantization for $QK^\top$ and FP8 quantization for $PV$. `8+16` uses FP16 with FP16 accumulator for $PV$.

	![Local Image](./assets/sage2++.png)

	![Local Image](./assets/4090_hd128.png)

	![Local Image](./assets/L20_hd128.png)

	![Local Image](./assets/H100_hd128.png)

	![Local Image](./assets/H20_hd128.png)

	![Local Image](./assets/A100_hd128.png)

	![Local Image](./assets/3090_hd128.png)

	> Note: The TOPS results refer only to the Attention Kernel, excluding the quantization and smoothing.

	### End-to-end Performance
	#### End-to-End Accuracy:

	![Local Image](./assets/22.png)

	![Local Image](./assets/23.png)

	![Local Image](./assets/24.png)

	![Local Image](./assets/25.png)


	#### End-to-End Speedup:

	![Local Image](./assets/26.png)


	## Citation
	If you use this code or find our work valuable, please cite:
	```
	@inproceedings{zhang2025sageattention,
	title={SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration},
	author={Zhang, Jintao and Wei, Jia and Zhang, Pengle and Zhu, Jun and Chen, Jianfei},
	booktitle={International Conference on Learning Representations (ICLR)},
	year={2025}
	}
	@inproceedings{zhang2024sageattention2,
	title={Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization},
	author={Zhang, Jintao and Huang, Haofeng and Zhang, Pengle and Wei, Jia and Zhu, Jun and Chen, Jianfei},
	booktitle={International Conference on Machine Learning (ICML)},
	year={2025}
	}
	@article{zhang2025sageattention3,
	title={SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training},
	author={Zhang, Jintao and Wei, Jia and Zhang, Pengle and Xu, Xiaoming and Huang, Haofeng and Wang, Haoxu and Jiang, Kai and Zhu, Jun and Chen, Jianfei},
	journal={arXiv preprint arXiv:2505.11594},
	year={2025}
	}
	@article{zhang2025sageattention2++,
	title={SageAttention2++: A More Efficient Implementation of SageAttention2},
	author={Zhang, Jintao and Xu, Xiaoming and Wei, Jia and Huang, Haofeng and Zhang, Pengle and Xiang, Chendong and Zhu, Jun and Chen, Jianfei},
	journal={arXiv preprint arXiv:2505.21136},
	year={2025}
	}
	```