Toddler-LLM (fully pre-trained on CHILDES)

Overview

  • Model name: Toddler-LLM
  • Type: Decoder-only small LM for toddler-like dialogue
  • Status: Fully pre-trained from scratch on child-directed speech, then SFT + GRPO
  • Primary language: English
  • Target behavior: Coherent, short, child-like responses (approx. 2–3 years old)
  • Parameter count: ~155M (see config below)
  • Intended domain: Parent–child conversational exchanges

Model architecture

  • hidden_size: 672
  • intermediate_size: 1809
  • num_hidden_layers: 31
  • num_attention_heads: 12
  • num_key_value_heads: 4
  • max_position_embeddings: 256
  • vocab_size: 8192
  • lower-case-only tokenizer
  • tie_word_embeddings: true
  • rope_theta: 10000.0
  • max input length: 256 tokens

Training data

  • Source: CHILDES (filtered, English-only)
  • Approx. 14M tokens after filtering
  • Pretraining exclusively on child-directed speech (no large-scale adult corpora)
  • Data filtering (for downstream SFT/GRPO): caregiver utterance clarity via RM-4 to select top 10% “helpful” caregiver prompts; coherence scoring for child utterances via RM-2

Training procedure

  • Stage 1: Pre-training
    • Library: Nanotron
    • Objective: next-token prediction
    • Steps: 25,000 (~64 epochs)
    • Peak learning rate: 0.0025
    • Loss convergence: just above 1.0
  • Stage 2: Chat SFT
    • Adapted to SmolLM2 Instruct chat template and special tokens
    • Library: unsloth (response-only SFT)
    • Curriculum on increasingly higher-quality subsets (by RM-2 and RM-4):
      • Top 10%: LR 9e-4, 2 epochs
      • Top 5%: LR 8e-4, 14 epochs
      • Top 2.5%: LR 7e-4, 7 epochs
      • Top 1.25%: LR 6e-4, 3 epochs
    • Reached stable response coherence around loss ≈ 0.45
  • Stage 3: GRPO optimization
    • GRPO learning rate: 1e-5
    • LoRA rank doubled vs. Step 1 (rank=128); target modules: [q,k,v,o,gate,up,down]
    • Steps: 1000
    • Reward weights: RM-1 (1.0), RM-2 (0.2), RM-3 (0.5)
    • Selected best checkpoint by manual inspection for coherence + child-likeness

Reward models and data filters used

  • RM-1 (Toddler-BERT): BERT classifier for “child-like” style found @ enochlev/childish_behavior_model
  • RM-2 (Coherence-BERT): BERT classifier trained with soft labels for coherence
    • Labeled by Llama‑3.3‑70B (batched for consistency), 0.0–1.0 scale found @ enochlev/child_coherence_model
    • Training: 5 epochs, BCEWithLogitsLoss, LR 2e-5, weight decay 0.01, batch size 150, max length 96
  • RM-3 (Length PMF): Bayesian-based PMF over sentence lengths from CHILDES, min–max normalized to [0,1], temperature for smoothness; per-sentence score scaled by 1/max(1, number_of_punctuations) to encourage one short sentence
  • RM-4 (Caregiver clarity): LLM-scored question clarity; used as a filter only (not a reward) to select top 10% caregiver prompts

Inference and prompt format

  • Chat template: compatible with SmolLM2 Instruct-style templates
  • Guidance:
    • Input: single caregiver question or brief prompt
    • Output: one short, coherent sentence with age-appropriate vocabulary
  • Example prompt format (generic):
    • System: “You are a 2–3-year-old child speaking in short, simple sentences.”
    • User: “Caregiver: What did you have for dessert for lunch?”
    • Assistant (model): “i had some spaghettis.” (Note: spelling/grammar may be age-typical)

Example generations (from evaluation set)

  • Caregiver Utterance: What did you have for dessert for lunch?
    • Gold Child Response: a ice cream sandwich
    • llm-toddler-30: i had some spaghettis.
  • Caregiver Utterance: What did we do with the toy?
    • Gold Child Response: we put it in his cage
    • llm-toddler-30: it broke.
  • Caregiver Utterance: Where is your bag?
    • Gold Child Response: I didn't bring it. It's in the car
    • llm-toddler-30: i got it
  • Caregiver Utterance: What did you find?
    • Gold Child Response: look. Elmo
    • llm-toddler-30: i found luke's lightsaber.
  • Caregiver Utterance: What did you put on the pizza?
    • Gold Child Response: cheese and cheese
    • llm-toddler-30: i put sause on it

Evaluation

  • Human evaluation (19 participants; two robot platforms: Cozmo, Misty II):
    • Perceived age: ~3 years (closer to target for our model)
    • Coherence: comparable across models; higher with Cozmo than Misty on average
    • AoA and vocabulary: humans used broader vocabulary and higher AoA; models remained lower AoA as intended; some SmolLM variants occasionally produced adult-level content
    • Notable: Participant expectations matched Cozmo’s child-like morphology/voice better than Misty’s

Intended use

  • Research on child-like conversational agents and human-robot interaction
  • Simulated child responses to caregiver prompts

Out-of-scope and limitations

  • Not for clinical, diagnostic, educational placement, or childcare decision-making
  • English-only; small corpus (≈14M tokens); limited world knowledge
  • Can produce off-context, random child-like words; may fixate on certain “baby words”
  • May generate age-inappropriate content in rare cases; monitor outputs
  • Sensitive to prompt phrasing; best with concise caregiver questions

Safety and ethical considerations

  • Use responsibly around minors; ensure adult supervision in interactive settings
  • Avoid anthropomorphizing beyond research context
  • Respect CHILDES data licenses and privacy norms
  • Models may reflect biases or artifacts from child-directed corpora
Downloads last month
1
Safetensors
Model size
0.2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support