slm-125m

A 125M decoder-only language model (base pretrained model). Part of the SLM model family โ€” built entirely from scratch, from raw web data through to a production-ready aligned model.

This is the base variant โ€” pretrained from a 10B curation target with no fine-tuning. It is suitable for research and as a starting point for further fine-tuning. Use tohio/slm-125m-instruct for instruction following or tohio/slm-125m-chat for aligned conversation.

Model Family

Variant Hub Description
Base tohio/slm-125m Pretrained only
Instruct tohio/slm-125m-instruct Chat + response-control + code SFT
Chat tohio/slm-125m-chat SFT + DPO aligned

Architecture

Component Choice Rationale
Positional encoding RoPE Better length generalisation, relative position awareness
Normalization RMSNorm Faster than LayerNorm, modern standard
Activation SwiGLU Better gradient flow, used by LLaMA and Mistral
Attention GQA Reduces KV cache memory at inference
Bias None Simpler, modern standard
Embeddings Tied Reduces parameters, effective at small scale
Vocab size 32,000 Custom BPE tokenizer trained on the pretraining corpus
Parameters 125.3M (125,264,640 parameters)

Training

Pretraining corpus โ€” 10B curation target blended across the following sources:

Source Target Share Realized Share Link
common_crawl 5.0% 5.00% Common Crawl
fineweb 10.0% 10.00% FineWeb
fineweb_edu 31.5% 31.50% FineWeb-Edu
wikipedia 10.0% 10.00% Wikipedia (EN)
pg19 2.5% 2.50% PG-19 (Project Gutenberg)
pes2o 5.0% 5.00% peS2o (academic papers)
nemotron_cc_math 7.0% 7.00% Nemotron CC Math
stackexchange 1.0% 1.00% StackExchange
synthetic_arithmetic 0.1% 0.15% Synthetic arithmetic
synthetic_task_code 0.4% 0.39% Synthetic task code
educational_qa_mcq_math 0.1% 0.15% Educational QA/MCQ (math)
educational_qa_mcq_general 0.2% 0.25% Educational QA/MCQ (general)
factual_restraint 0.1% 0.07% Factual restraint
nemotron_specialized 12.0% 12.00% Nemotron Specialized
stack_v1 12.45% 12.53% The Stack v1 dedup
codesearchnet 2.25% 2.25% CodeSearchNet
stack_smol 0.15% 0.09% The Stack (smol)
jupyter 0.07% 0.06% Jupyter notebooks
conala 0.07% 0.08% CoNaLa

Realized: ~10.00B tokens (15,202,249 train + 76,393 val docs). Supply-bound sources route their deficit to FineWeb.

Hardware: NVIDIA H200 (pretraining on 1ร— H200, fine-tuning on 1ร— H200)

Evaluation

Evaluated using lm-evaluation-harness.

Benchmark Few-shot Metric Score
HellaSwag 10-shot acc_norm 0.3115
ARC-Easy 25-shot acc_norm 0.4903
ARC-Challenge 25-shot acc_norm 0.2491
MMLU 5-shot acc 0.2526
TruthfulQA 0-shot acc 0.4211
HumanEval 0-shot pass@1 0.0000

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "tohio/slm-125m",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "tohio/slm-125m",
    trust_remote_code=True,
)

messages = [
    {"role": "system", "content": "Answer clearly and concisely."},
    {"role": "user", "content": "Explain what a transformer is."},
]

inputs = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    add_generation_prompt=True,
    return_dict=True,
)

endofturn_id = tokenizer.convert_tokens_to_ids("<|endofturn|>")

output = model.generate(
    **inputs,
    max_new_tokens=120,
    do_sample=False,
    repetition_penalty=1.1,
    pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id,
    eos_token_id=[tokenizer.eos_token_id, endofturn_id],
)

input_len = inputs["input_ids"].shape[1]
print(tokenizer.decode(output[0][input_len:], skip_special_tokens=True))

trust_remote_code=True loads the custom SLM architecture bundled alongside the model weights โ€” no local install of the tohio/slm codebase required.

Limitations

  • Scale: At 125M parameters this model is significantly smaller than frontier models. It will underperform on complex reasoning, long-context tasks, and domains not well-represented in the pretraining data.
  • Hallucination: Like all language models, this model can generate plausible-sounding but factually incorrect content. Outputs should not be used as a source of truth without independent verification.
  • Safety: DPO alignment provides basic harmlessness training but does not guarantee safe outputs in all contexts. This model has not undergone red-teaming or adversarial safety evaluation.
  • Languages: Training data is predominantly English. Performance on other languages will be significantly degraded.
  • Code: Code generation is primarily Python-oriented, reflecting the code sub-mix distribution used in pretraining and SFT.

Related

  • slm โ€” full training pipeline (data curation through serving)
  • ai-infra โ€” production Kubernetes serving via vLLM
Downloads last month
67
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for tohio/slm-125m

Finetunes
2 models