arxiv:2507.00994

Should We Still Pretrain Encoders with Masked Language Modeling?

Published on Jul 1

· Submitted by

Nicolas-BZRD on Jul 8

#2 Paper of the day

Upvote

Authors:

Hippolyte Gisserot-Boukhlef ,

Nicolas Boizard ,

Manuel Faysse ,

Emmanuel Malherbe ,

Céline Hudelot ,

Pierre Colombo

Abstract

Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder pretraining has traditionally relied on Masked Language Modeling (MLM), recent evidence suggests that decoder models pretrained with Causal Language Modeling (CLM) can be effectively repurposed as encoders, often surpassing traditional encoders on text representation benchmarks. However, it remains unclear whether these gains reflect an inherent advantage of the CLM objective or arise from confounding factors such as model and data scale. In this paper, we address this question through a series of large-scale, carefully controlled pretraining ablations, training a total of 30 models ranging from 210 million to 1 billion parameters, and conducting over 15,000 fine-tuning and evaluation runs. We find that while training with MLM generally yields better performance across text representation tasks, CLM-trained models are more data-efficient and demonstrate improved fine-tuning stability. Building on these findings, we experimentally show that a biphasic training strategy that sequentially applies CLM and then MLM, achieves optimal performance under a fixed computational training budget. Moreover, we demonstrate that this strategy becomes more appealing when initializing from readily available pretrained CLM models (from the existing LLM ecosystem), reducing the computational burden needed to train best-in-class encoder models. We release all project artifacts at https://hf.co/MLMvsCLM to foster further research.

View arXiv page View PDF Project page GitHub 66 Add to collection

Community

stefan-it

26 days ago

•

edited 26 days ago

Let's have a look at token classification:

There's a problem with the assumption that CLM is better than MLM:

All MLM models are using EuroBERT. And EuroBERT is super bad, really bad for token classification, see EuroBERT paper (Table 1: https://arxiv.org/pdf/2503.05500).

So the assumption is maybe correct using EuroBERT, but definitely not for e.g. XLM-R.

I made comparisons with GPT-2 and BERT on CoNLL-2003 years ago. CLM is clearly behind, see here.

gaochangkuan

25 days ago

Are there scenarios in text representation, such as embedding vector recall, where encoder models, which are bidirectionally encoded, are likely to outperform decoders of the same scale?

Nicolas-BZRD

Paper author Paper submitter 20 days ago

Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder pretraining has traditionally relied on Masked Language Modeling (MLM), recent evidence suggests that decoder models pretrained with Causal Language Modeling (CLM) can be effectively repurposed as encoders, often surpassing traditional encoders on text representation benchmarks. However, it remains unclear whether these gains reflect an inherent advantage of the CLM objective or arise from confounding factors such as model and data scale. In this paper, we address this question through a series of large-scale, carefully controlled pretraining ablations, training a total of 38 models ranging from 210 million to 1 billion parameters, and conducting over 15,000 fine-tuning and evaluation runs. We find that while training with MLM generally yields better performance across text representation tasks, CLM-trained models are more data-efficient and demonstrate improved fine-tuning stability. Building on these findings, we experimentally show that a biphasic training strategy that sequentially applies CLM and then MLM, achieves optimal performance under a fixed computational training budget. Moreover, we demonstrate that this strategy becomes more appealing when initializing from readily available pretrained CLM models, reducing the computational burden needed to train best-in-class encoder models. We release all project artifacts to foster further research.

grantsing

20 days ago

arXiv explained breakdown of this paper 👉 https://arxivexplained.com/papers/should-we-still-pretrain-encoders-with-masked-language-modeling

Nicolas-BZRD

Paper author 20 days ago

Thanks for sharing 🤗! For those who want a quick read, we’ve also written a blog post: https://huggingface.co/blog/Nicolas-BZRD/encoders-should-not-be-only-pre-trained-with-mlm

awagner-mainz

7 days ago

•

edited 7 days ago

I'm sorry for asking this with only (at best) half knowledge of the subject matter, but felt encouraged to do so anyway as you are explicitly mentioning Electra in the related research section: would you think its training approach would make a difference if you could get your hands on a collection of electra models and subjected them to the same series of tests? The point of it was precisely an improved data efficiency, IIUC. I have been wondering why there is so little uptake of this training approach...

hgissbkh

Paper author 7 days ago

Hi, and thanks for your question!
You're absolutely right to mention ELECTRA, especially given its focus on sample efficiency. 
Overall, we chose to focus on MLM because it remains the most widely adopted encoder pretraining objective in recent literature.
Another important reason is that comparing MLM with CLM is more straightforward, since both rely on training encoders as generators. In contrast, ELECTRA frames pretraining as a discrimination task, which introduces a more fundamental shift in training dynamics.
We also noticed that some well-known ELECTRA-trained models, such as DeBERTaV3, tend to underperform on retrieval tasks, a scenario we aimed to avoid in our study.
That said, it's a very thoughtful point and definitely an interesting direction for future work.
Cheers!