Papers
arxiv:2506.03077

StreamBP: Memory-Efficient Exact Backpropagation for Long Sequence Training of LLMs

Published on Jun 3
ยท Submitted by Kullpar on Jun 6
Authors:
,
,

Abstract

StreamBP, a memory-efficient and exact backpropagation method, decomposes the chain rule to reduce memory costs, enabling longer sequence lengths and faster training speeds for language models compared to gradient checkpointing.

AI-generated summary

Training language models on long sequence data is a demanding requirement for enhancing the model's capability on complex tasks, e.g., long-chain reasoning. However, as the sequence length scales up, the memory cost for storing activation values becomes huge during the Backpropagation (BP) process, even with the application of gradient checkpointing technique. To tackle this challenge, we propose a memory-efficient and exact BP method called StreamBP, which performs a linear decomposition of the chain rule along the sequence dimension in a layer-wise manner, significantly reducing the memory cost of activation values and logits. The proposed method is applicable to common objectives such as SFT, GRPO, and DPO. From an implementation perspective, StreamBP achieves less computational FLOPs and faster BP speed by leveraging the causal structure of the language model. Compared to gradient checkpointing, StreamBP scales up the maximum sequence length of BP by 2.8-5.5 times larger, while using comparable or even less BP time. Note that StreamBP's sequence length scaling ability can be directly transferred to batch size scaling for accelerating training. We further develop a communication-efficient distributed StreamBP to effectively support multi-GPU training and broaden its applicability. Our code can be easily integrated into the training pipeline of any transformer models and is available at https://github.com/Ledzy/StreamBP.

Community

Paper author Paper submitter

Project Page: https://github.com/Ledzy/StreamBP

StreamBP substantially reduces the memory cost of activation values and scales up the maximum sequence length by 2.8-5.5 times larger than gradient checkpointing, while using similar or even less BP time. On another front, this sequence length scaling ability can be directly transferred to batch size scaling for faster training, as memory cost scales linearly with sequence length.โ€‹

The memory consumption of BP for StreamBP compared to the vanilla method and gradient checkpointing is shown in the figure below.

bp_memory

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.03077 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.03077 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.03077 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.