slm-125m

A 125M decoder-only language model (base pretrained model). Part of the SLM model family — built entirely from scratch, from raw web data through to a production-ready aligned model.

This is the base variant — pretrained from a 10B curation target with no fine-tuning. It is suitable for research and as a starting point for further fine-tuning. Use tohio/slm-125m-instruct for instruction following or tohio/slm-125m-chat for aligned conversation.

Model Family

Variant	Hub	Description
Base	tohio/slm-125m	Pretrained only
Instruct	tohio/slm-125m-instruct	Chat + response-control + code SFT
Chat	tohio/slm-125m-chat	SFT + DPO aligned

Architecture

Component	Choice	Rationale
Positional encoding	RoPE	Better length generalisation, relative position awareness
Normalization	RMSNorm	Faster than LayerNorm, modern standard
Activation	SwiGLU	Better gradient flow, used by LLaMA and Mistral
Attention	GQA	Reduces KV cache memory at inference
Bias	None	Simpler, modern standard
Embeddings	Tied	Reduces parameters, effective at small scale
Vocab size	32,000	Custom BPE tokenizer trained on the pretraining corpus
Parameters	125.3M (125,264,640 parameters)

Training

Pretraining corpus — 10B curation target blended across the following sources:

Source	Target Share	Realized Share	Link
`common_crawl`	5.0%	5.00%	Common Crawl
`fineweb`	10.0%	10.00%	FineWeb
`fineweb_edu`	31.5%	31.50%	FineWeb-Edu
`wikipedia`	10.0%	10.00%	Wikipedia (EN)
`pg19`	2.5%	2.50%	PG-19 (Project Gutenberg)
`pes2o`	5.0%	5.00%	peS2o (academic papers)
`nemotron_cc_math`	7.0%	7.00%	Nemotron CC Math
`stackexchange`	1.0%	1.00%	StackExchange
`synthetic_arithmetic`	0.1%	0.15%	Synthetic arithmetic
`synthetic_task_code`	0.4%	0.39%	Synthetic task code
`educational_qa_mcq_math`	0.1%	0.15%	Educational QA/MCQ (math)
`educational_qa_mcq_general`	0.2%	0.25%	Educational QA/MCQ (general)
`factual_restraint`	0.1%	0.07%	Factual restraint
`nemotron_specialized`	12.0%	12.00%	Nemotron Specialized
`stack_v1`	12.45%	12.53%	The Stack v1 dedup
`codesearchnet`	2.25%	2.25%	CodeSearchNet
`stack_smol`	0.15%	0.09%	The Stack (smol)
`jupyter`	0.07%	0.06%	Jupyter notebooks
`conala`	0.07%	0.08%	CoNaLa

Realized: ~10.00B tokens (15,202,249 train + 76,393 val docs). Supply-bound sources route their deficit to FineWeb.

Hardware: NVIDIA H200 (pretraining on 1× H200, fine-tuning on 1× H200)

Evaluation

Evaluated using lm-evaluation-harness.

Benchmark	Few-shot	Metric	Score
HellaSwag	10-shot	acc_norm	0.3115
ARC-Easy	25-shot	acc_norm	0.4903
ARC-Challenge	25-shot	acc_norm	0.2491
MMLU	5-shot	acc	0.2526
TruthfulQA	0-shot	acc	0.4211
HumanEval	0-shot	pass@1	0.0000

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "tohio/slm-125m",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "tohio/slm-125m",
    trust_remote_code=True,
)

messages = [
    {"role": "system", "content": "Answer clearly and concisely."},
    {"role": "user", "content": "Explain what a transformer is."},
]

inputs = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    add_generation_prompt=True,
    return_dict=True,
)

endofturn_id = tokenizer.convert_tokens_to_ids("<|endofturn|>")

output = model.generate(
    **inputs,
    max_new_tokens=120,
    do_sample=False,
    repetition_penalty=1.1,
    pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id,
    eos_token_id=[tokenizer.eos_token_id, endofturn_id],
)

input_len = inputs["input_ids"].shape[1]
print(tokenizer.decode(output[0][input_len:], skip_special_tokens=True))

trust_remote_code=True loads the custom SLM architecture bundled alongside the model weights — no local install of the tohio/slm codebase required.

Limitations

Scale: At 125M parameters this model is significantly smaller than frontier models. It will underperform on complex reasoning, long-context tasks, and domains not well-represented in the pretraining data.
Hallucination: Like all language models, this model can generate plausible-sounding but factually incorrect content. Outputs should not be used as a source of truth without independent verification.
Safety: DPO alignment provides basic harmlessness training but does not guarantee safe outputs in all contexts. This model has not undergone red-teaming or adversarial safety evaluation.
Languages: Training data is predominantly English. Performance on other languages will be significantly degraded.
Code: Code generation is primarily Python-oriented, reflecting the code sub-mix distribution used in pretraining and SFT.

slm — full training pipeline (data curation through serving)
ai-infra — production Kubernetes serving via vLLM

Downloads last month: 67

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for tohio/slm-125m

Finetunes

2 models

tohio
/

slm-125m