Overview

This checkpoint continues the pre-training of answerdotai/ModernBERT-base on Scandinavian text, extending the model’s knowledge with ~1.2 trillion additional masked-language-model (MLM) tokens drawn from The Nordic Pile and SWEb while preserving the original 8k token context window.

This is a research artefact and is only intended for research purposes.

Our tokenizer is trained from scratch on a subset of 11 985 103 472 tokens.

The training is done in one stage with 8192 tokens per sample for the whole run.

Data Sources

Corpus	Size	Selected Languages	Highlights
The Nordic Pile	1.2 TB raw text	sv, no, da, is	Nine diverse categories (CC, Wikipedia, Books, Code, etc.), filtered and deduplicated for high quality
SWEb	1 T+ tokens (~3.6 TB)	sv, no, da, is	98 Common-Crawl snapshots with model-based HTML extraction; 1.2 B documents

Training Setup

Setting	Value
Parameters	150 M
Context length	8 192 tokens (RoPE + local-global attention)
Tokens processed	1.20 × 10¹²
Tokens per batch	1 572 864
Global batch	192 sequences (micro-batch = 3)
Optimizer & schedule	Decoupled StableAdamW, lr 2 e-4, cosine decay (1 % warm-up)
Precision	AMP-bf16
Hardware	8 nodes × 8 AMD MI250X GPUs (64 GPUs) on the EuroHPC LUMI-G system

See training details here

Training Stats

[token=1198511677292/1198510347252]:
  Train time/batch: 873585
  Train time/sample: 167728320
  Train time/batch_in_epoch: 3558
  Train time/sample_in_epoch: 683136
  Train time/token: 1198510256276
  Train time/token_in_epoch: 4882888303
  Train trainer/device_train_microbatch_size: 3
  Train loss/train/total: 0.9966
  Train throughput/batches_per_sec: 1.3117
  Train throughput/samples_per_sec: 251.8442
  Train throughput/device/batches_per_sec: 0.0205
  Train throughput/device/samples_per_sec: 3.9351
  Train throughput/tokens_per_sec: 1804244.5198
  Train throughput/device/tokens_per_sec: 28191.3206
  Train time/train: 184.5555
  Train time/val: 0.0000
  Train time/total: 184.5555
  Train lr-StableAdamW/group0: 0.0000
  Train lr-StableAdamW/group1: 0.0000

Intended Use

This is a research artefact and is only intended for research purposes.

Fill-mask inference, embedding extraction and fine-tuning for Scandinavian downstream NLP tasks (classification, NER, QA, etc.).
Drop-in replacement for BERT-style encoders (omit token_type_ids).

Fill-mask

from transformers import pipeline
unmasker = pipeline('fill-mask', model='AI-Sweden-Models/ModernBERT-base')
unmasker("Huvudstaden i Sverige är [MASK].")

[{'score': 0.0629318505525589,
  'token': 2961,
  'token_str': ' Stockholm',
  'sequence': 'Huvudstaden i Sverige är  Stockholm.'},
 {'score': 0.03635135293006897,
  'token': 49763,
  'token_str': 'awesome',
  'sequence': 'Huvudstaden i Sverige är awesome.'},
 {'score': 0.03006783314049244,
  'token': 751,
  'token_str': ' stor',
  'sequence': 'Huvudstaden i Sverige är  stor.'},
 {'score': 0.029827557504177094,
  'token': 71,
  'token_str': 'a',
  'sequence': 'Huvudstaden i Sverige är a.'},
 {'score': 0.019739385694265366,
  'token': 79,
  'token_str': 'i',
  'sequence': 'Huvudstaden i Sverige är i.'}]

Limitations & Biases

Web corpora can contain noise, stereotypes and sensitive content despite filtering.
RoPE extrapolation beyond 8 k tokens is untested and may degrade.

Code to reproduce

Downloads last month: 67

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for AI-Sweden-Models/ModernBERT-base

Base model

answerdotai/ModernBERT-base

Finetuned

(781)

this model