Fill-Mask
Transformers
Safetensors
modernbert
masked-lm
long-context

Overview

This checkpoint continues the pre-training of answerdotai/ModernBERT-base on Scandinavian text, extending the model’s knowledge with ~1.2 trillion additional masked-language-model (MLM) tokens drawn from The Nordic Pile and SWEb while preserving the original 8k token context window.

This is a research artefact and is only intended for research purposes.

Our tokenizer is trained from scratch on a subset of 11 985 103 472 tokens.

The training is done in one stage with 8192 tokens per sample for the whole run.

Data Sources

Corpus Size Selected Languages Highlights
The Nordic Pile 1.2 TB raw text sv, no, da, is Nine diverse categories (CC, Wikipedia, Books, Code, etc.), filtered and deduplicated for high quality
SWEb 1 T+ tokens (~3.6 TB) sv, no, da, is 98 Common-Crawl snapshots with model-based HTML extraction; 1.2 B documents

Training Setup

Setting Value
Parameters 150 M
Context length 8 192 tokens (RoPE + local-global attention)
Tokens processed 1.20 × 1012
Tokens per batch 1 572 864
Global batch 192 sequences (micro-batch = 3)
Optimizer & schedule Decoupled StableAdamW, lr 2 e-4, cosine decay (1 % warm-up)
Precision AMP-bf16
Hardware 8 nodes × 8 AMD MI250X GPUs (64 GPUs) on the EuroHPC LUMI-G system

See training details here

Training Stats

[token=1198511677292/1198510347252]:
  Train time/batch: 873585
  Train time/sample: 167728320
  Train time/batch_in_epoch: 3558
  Train time/sample_in_epoch: 683136
  Train time/token: 1198510256276
  Train time/token_in_epoch: 4882888303
  Train trainer/device_train_microbatch_size: 3
  Train loss/train/total: 0.9966
  Train throughput/batches_per_sec: 1.3117
  Train throughput/samples_per_sec: 251.8442
  Train throughput/device/batches_per_sec: 0.0205
  Train throughput/device/samples_per_sec: 3.9351
  Train throughput/tokens_per_sec: 1804244.5198
  Train throughput/device/tokens_per_sec: 28191.3206
  Train time/train: 184.5555
  Train time/val: 0.0000
  Train time/total: 184.5555
  Train lr-StableAdamW/group0: 0.0000
  Train lr-StableAdamW/group1: 0.0000

Intended Use

This is a research artefact and is only intended for research purposes.

  • Fill-mask inference, embedding extraction and fine-tuning for Scandinavian downstream NLP tasks (classification, NER, QA, etc.).
  • Drop-in replacement for BERT-style encoders (omit token_type_ids).

Fill-mask

from transformers import pipeline
unmasker = pipeline('fill-mask', model='AI-Sweden-Models/ModernBERT-base')
unmasker("Huvudstaden i Sverige är [MASK].")
[{'score': 0.0629318505525589,
  'token': 2961,
  'token_str': ' Stockholm',
  'sequence': 'Huvudstaden i Sverige är  Stockholm.'},
 {'score': 0.03635135293006897,
  'token': 49763,
  'token_str': 'awesome',
  'sequence': 'Huvudstaden i Sverige är awesome.'},
 {'score': 0.03006783314049244,
  'token': 751,
  'token_str': ' stor',
  'sequence': 'Huvudstaden i Sverige är  stor.'},
 {'score': 0.029827557504177094,
  'token': 71,
  'token_str': 'a',
  'sequence': 'Huvudstaden i Sverige är a.'},
 {'score': 0.019739385694265366,
  'token': 79,
  'token_str': 'i',
  'sequence': 'Huvudstaden i Sverige är i.'}]

Limitations & Biases

  • Web corpora can contain noise, stereotypes and sensitive content despite filtering.
  • RoPE extrapolation beyond 8 k tokens is untested and may degrade.

Code to reproduce

Downloads last month
67
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AI-Sweden-Models/ModernBERT-base

Finetuned
(781)
this model