Papers
arxiv:2506.05229

Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts

Published on Jun 5
· Submitted by yurakuratov on Jun 6
Authors:
,
,

Abstract

Diagonal Batching enables parallel inference in Recurrent Memory Transformers, significantly improving speed and efficiency for long-context tasks.

AI-generated summary

Transformer models struggle with long-context inference due to their quadratic time and linear memory complexity. Recurrent Memory Transformers (RMTs) offer a solution by reducing the asymptotic cost to linear time and constant memory usage. However, their memory update mechanism leads to sequential execution, causing a performance bottleneck. We introduce Diagonal Batching, a scheduling scheme that unlocks parallelism across segments in RMTs while preserving exact recurrence. This approach eliminates the sequential constraint, enabling efficient GPU inference even for single long-context inputs without complex batching and pipelining techniques. Because the technique is purely a run-time computation reordering, existing RMT models adopt it with no retraining. Applied to a LLaMA-1B ARMT model, Diagonal Batching yields a 3.3x speedup over standard full-attention LLaMA-1B and a 1.8x speedup over the sequential RMT implementation on 131,072-token sequences. By removing sequential bottleneck, Diagonal Batching reduces inference cost and latency, thereby strengthening RMTs as a practical solution for real-world, long-context applications.

Community

Paper author Paper submitter
Paper author

Paper author

Left:
Parallel RMT generalizes a family of models with layer-level memory.
Each layer maintains its own memory state and passes it horizontally to the same layer in the next segment. This eliminates inter-layer memory flow, yet still requires processing segments in order within each layer, thereby creating layer-wise recurrence.

Right:
Diagonal Batching rearranges the 2D grid of layers (rows) and segments (columns) into independent "diagonals" (same colored blocks).
This allows all operations on one diagonal (up to N_Layers) to execute concurrently on the GPU, thus
eliminating the sequential bottleneck while preserving all layer-level recurrence.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.05229 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.05229 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.05229 in a Space README.md to link it from this page.

Collections including this paper 1