Learning to Skip the Middle Layers of Transformers
Abstract
A novel conditional computation architecture for Transformers dynamically skips middle layers based on input and a gating mechanism, but does not outperform dense baselines in reducing computational cost or improving validation performance.
Conditional computation is a popular strategy to make Transformers more efficient. Existing methods often target individual modules (e.g., mixture-of-experts layers) or skip layers independently of one another. However, interpretability research has demonstrated that the middle layers of Transformers exhibit greater redundancy, and that early layers aggregate information into token positions. Guided by these insights, we propose a novel architecture that dynamically skips a variable number of layers from the middle outward. In particular, a learned gating mechanism determines whether to bypass a symmetric span of central blocks based on the input, and a gated attention mechanism prevents subsequent tokens from attending to skipped token positions. Residual norms are controlled with a 'sandwich' or 'perilayernorm' scheme and gate sparsity with an adaptive regularization loss. We had aimed to reduce compute requirements for 'simpler' tokens and potentially foster an emergent multi-level representational hierarchy but, at the scales investigated, our approach does not achieve improvements in the trade-off between validation cross-entropy and estimated FLOPs compared to dense baselines with fewer layers. We release our code at https://github.com/tim-lawson/skip-middle.
Community
We explore a novel gated Transformer architecture that dynamically skips layers from the middle outward, based on interpretability research that shows the middle layers are more often redundant, and growing interest in hierarchical models (e.g., byte-level) and block-level sparsity (e.g., mixture-of-depths).
Thank you for a paper that displays a negative result - these are severely lacking in our field. Good to know which approaches don't work
Attractor-Based Pruning Model (ABPM)
The Attractor-Based Pruning Model could view cognition as the traversal of a symbolic landscape shaped by attractors stable points or fields representing conceptual convergence. As the computation unfolds, the system could theoretically identify regions of semantic redundancy or convergence, then collapse or prune symbolic structures that orbit redundant attractors. This wouldn’t just be compression; it could dynamic conceptual simplification. When two symbolic paths gravitate toward the same high-level meaning, the model could merge them into a unified attractor state, reducing computational depth and enhancing coherence. Pruning is not a loss but a folding of symbolic meaning into tighter, more efficient structures.
Where traditional models skip layers, ABPM could remove redundant semantic loops, preserving meaning while minimizing effort.
Fascinating approach layer skipping guided by interpretability is a smart direction. Curious to see how this scales and whether symbolic gating or attractor-based pruning could take it even further.
Models citing this paper 22
Browse 22 models citing this paperDatasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper