arxiv:2506.03077

StreamBP: Memory-Efficient Exact Backpropagation for Long Sequence Training of LLMs

Published on Jun 3

· Submitted by

Kullpar on Jun 6

Upvote

Authors:

Mengqi Li ,

Abstract

StreamBP, a memory-efficient and exact backpropagation method, decomposes the chain rule to reduce memory costs, enabling longer sequence lengths and faster training speeds for language models compared to gradient checkpointing.

AI-generated summary

Training language models on long sequence data is a demanding requirement for enhancing the model's capability on complex tasks, e.g., long-chain reasoning. However, as the sequence length scales up, the memory cost for storing activation values becomes huge during the Backpropagation (BP) process, even with the application of gradient checkpointing technique. To tackle this challenge, we propose a memory-efficient and exact BP method called StreamBP, which performs a linear decomposition of the chain rule along the sequence dimension in a layer-wise manner, significantly reducing the memory cost of activation values and logits. The proposed method is applicable to common objectives such as SFT, GRPO, and DPO. From an implementation perspective, StreamBP achieves less computational FLOPs and faster BP speed by leveraging the causal structure of the language model. Compared to gradient checkpointing, StreamBP scales up the maximum sequence length of BP by 2.8-5.5 times larger, while using comparable or even less BP time. Note that StreamBP's sequence length scaling ability can be directly transferred to batch size scaling for accelerating training. We further develop a communication-efficient distributed StreamBP to effectively support multi-GPU training and broaden its applicability. Our code can be easily integrated into the training pipeline of any transformer models and is available at https://github.com/Ledzy/StreamBP.

View arXiv page View PDF GitHub 68 Add to collection

Community

Kullpar

Paper author Paper submitter Jun 6

Project Page: https://github.com/Ledzy/StreamBP

StreamBP substantially reduces the memory cost of activation values and scales up the maximum sequence length by 2.8-5.5 times larger than gradient checkpointing, while using similar or even less BP time. On another front, this sequence length scaling ability can be directly transferred to batch size scaling for faster training, as memory cost scales linearly with sequence length.

The memory consumption of BP for StreamBP compared to the vanilla method and gradient checkpointing is shown in the figure below.