Abstract
A new tokenizer, SuperBPE, improves language model performance and efficiency by learning superwords that bridge subwords and whitespace.
The assumption across nearly all language model (LM) tokenization schemes is that tokens should be subwords, i.e., contained within word boundaries. While providing a seemingly reasonable inductive bias, is this common practice limiting the potential of modern LMs? Whitespace is not a reliable delimiter of meaning, as evidenced by multi-word expressions (e.g., "by the way"), crosslingual variation in the number of words needed to express a concept (e.g., "spacesuit helmet" in German is "raumanzughelm"), and languages that do not use whitespace at all (e.g., Chinese). To explore the potential of tokenization beyond subwords, we introduce a "superword" tokenizer, SuperBPE, which incorporates a simple pretokenization curriculum into the byte-pair encoding (BPE) algorithm to first learn subwords, then superwords that bridge whitespace. This brings dramatic improvements in encoding efficiency: when fixing the vocabulary size to 200k, SuperBPE encodes a fixed piece of text with up to 33% fewer tokens than BPE on average. In experiments, we pretrain 8B transformer LMs from scratch while fixing the model size, vocabulary size, and train compute, varying *only* the algorithm for learning the vocabulary. Our model trained with SuperBPE achieves an average +4.0% absolute improvement over the BPE baseline across 30 downstream tasks (including +8.2% on MMLU), while simultaneously requiring 27% less compute at inference time. In analysis, we find that SuperBPE results in segmentations of text that are more uniform in per-token difficulty. Qualitatively, this may be because SuperBPE tokens often capture common multi-word expressions that function semantically as a single unit. SuperBPE is a straightforward, local modification to tokenization that improves both encoding efficiency and downstream performance, yielding better language models overall.
Community
Thank you for your work! While reviewing your paper, I noticed that when comparing the superbpe and standard BPE models, the superbpe has a shorter context length and requires more training steps (with FLOPs held constant). I was wondering if you could report:
- The results of the current superbpe model under the same number of training steps.
- The results of retraining with a context length of 4096 for the same number of training steps.
I think the second scenario might better reflect real training practices.
Thank you!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Pretraining Language Models to Ponder in Continuous Space (2025)
- BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization (2025)
- MorphTok: Morphologically Grounded Tokenization for Indian Languages (2025)
- The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models (2025)
- AweDist: Attention-aware Embedding Distillation for New Input Token Embeddings (2025)
- Let's Predict Sentence by Sentence (2025)
- Lugha-Llama: Adapting Large Language Models for African Languages (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 5
Browse 5 models citing this paperDatasets citing this paper 0
No dataset linking this paper