StreamBP: Memory-Efficient Exact Backpropagation for Long Sequence Training of LLMs
Abstract
StreamBP, a memory-efficient and exact backpropagation method, decomposes the chain rule to reduce memory costs, enabling longer sequence lengths and faster training speeds for language models compared to gradient checkpointing.
Training language models on long sequence data is a demanding requirement for enhancing the model's capability on complex tasks, e.g., long-chain reasoning. However, as the sequence length scales up, the memory cost for storing activation values becomes huge during the Backpropagation (BP) process, even with the application of gradient checkpointing technique. To tackle this challenge, we propose a memory-efficient and exact BP method called StreamBP, which performs a linear decomposition of the chain rule along the sequence dimension in a layer-wise manner, significantly reducing the memory cost of activation values and logits. The proposed method is applicable to common objectives such as SFT, GRPO, and DPO. From an implementation perspective, StreamBP achieves less computational FLOPs and faster BP speed by leveraging the causal structure of the language model. Compared to gradient checkpointing, StreamBP scales up the maximum sequence length of BP by 2.8-5.5 times larger, while using comparable or even less BP time. Note that StreamBP's sequence length scaling ability can be directly transferred to batch size scaling for accelerating training. We further develop a communication-efficient distributed StreamBP to effectively support multi-GPU training and broaden its applicability. Our code can be easily integrated into the training pipeline of any transformer models and is available at https://github.com/Ledzy/StreamBP.
Community
Project Page: https://github.com/Ledzy/StreamBP
StreamBP substantially reduces the memory cost of activation values and scales up the maximum sequence length by 2.8-5.5 times larger than gradient checkpointing, while using similar or even less BP time. On another front, this sequence length scaling ability can be directly transferred to batch size scaling for faster training, as memory cost scales linearly with sequence length.โ
The memory consumption of BP for StreamBP compared to the vanilla method and gradient checkpointing is shown in the figure below.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Training Long-Context LLMs Efficiently via Chunk-wise Optimization (2025)
- MOM: Memory-Efficient Offloaded Mini-Sequence Inference for Long Context Language Models (2025)
- FlashKAT: Understanding and Addressing Performance Bottlenecks in the Kolmogorov-Arnold Transformer (2025)
- SlimPipe: Memory-Thrifty and Efficient Pipeline Parallelism for Long-Context LLM Training (2025)
- Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity (2025)
- Skrull: Towards Efficient Long Context Fine-tuning through Dynamic Data Scheduling (2025)
- SpecOffload: Unlocking Latent GPU Capacity for LLM Inference on Resource-Constrained Devices (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper