Toddler-LLM (fully pre-trained on CHILDES)
Overview
- Model name: Toddler-LLM
- Type: Decoder-only small LM for toddler-like dialogue
- Status: Fully pre-trained from scratch on child-directed speech, then SFT + GRPO
- Primary language: English
- Target behavior: Coherent, short, child-like responses (approx. 2–3 years old)
- Parameter count: ~155M (see config below)
- Intended domain: Parent–child conversational exchanges
Model architecture
- hidden_size: 672
- intermediate_size: 1809
- num_hidden_layers: 31
- num_attention_heads: 12
- num_key_value_heads: 4
- max_position_embeddings: 256
- vocab_size: 8192
- lower-case-only tokenizer
- tie_word_embeddings: true
- rope_theta: 10000.0
- max input length: 256 tokens
Training data
- Source: CHILDES (filtered, English-only)
- Approx. 14M tokens after filtering
- Pretraining exclusively on child-directed speech (no large-scale adult corpora)
- Data filtering (for downstream SFT/GRPO): caregiver utterance clarity via RM-4 to select top 10% “helpful” caregiver prompts; coherence scoring for child utterances via RM-2
Training procedure
- Stage 1: Pre-training
- Library: Nanotron
- Objective: next-token prediction
- Steps: 25,000 (~64 epochs)
- Peak learning rate: 0.0025
- Loss convergence: just above 1.0
- Stage 2: Chat SFT
- Adapted to SmolLM2 Instruct chat template and special tokens
- Library: unsloth (response-only SFT)
- Curriculum on increasingly higher-quality subsets (by RM-2 and RM-4):
- Top 10%: LR 9e-4, 2 epochs
- Top 5%: LR 8e-4, 14 epochs
- Top 2.5%: LR 7e-4, 7 epochs
- Top 1.25%: LR 6e-4, 3 epochs
- Reached stable response coherence around loss ≈ 0.45
- Stage 3: GRPO optimization
- GRPO learning rate: 1e-5
- LoRA rank doubled vs. Step 1 (rank=128); target modules: [q,k,v,o,gate,up,down]
- Steps: 1000
- Reward weights: RM-1 (1.0), RM-2 (0.2), RM-3 (0.5)
- Selected best checkpoint by manual inspection for coherence + child-likeness
Reward models and data filters used
- RM-1 (Toddler-BERT): BERT classifier for “child-like” style found @ enochlev/childish_behavior_model
- RM-2 (Coherence-BERT): BERT classifier trained with soft labels for coherence
- Labeled by Llama‑3.3‑70B (batched for consistency), 0.0–1.0 scale found @ enochlev/child_coherence_model
- Training: 5 epochs, BCEWithLogitsLoss, LR 2e-5, weight decay 0.01, batch size 150, max length 96
- RM-3 (Length PMF): Bayesian-based PMF over sentence lengths from CHILDES, min–max normalized to [0,1], temperature for smoothness; per-sentence score scaled by 1/max(1, number_of_punctuations) to encourage one short sentence
- RM-4 (Caregiver clarity): LLM-scored question clarity; used as a filter only (not a reward) to select top 10% caregiver prompts
Inference and prompt format
- Chat template: compatible with SmolLM2 Instruct-style templates
- Guidance:
- Input: single caregiver question or brief prompt
- Output: one short, coherent sentence with age-appropriate vocabulary
- Example prompt format (generic):
- System: “You are a 2–3-year-old child speaking in short, simple sentences.”
- User: “Caregiver: What did you have for dessert for lunch?”
- Assistant (model): “i had some spaghettis.” (Note: spelling/grammar may be age-typical)
Example generations (from evaluation set)
- Caregiver Utterance: What did you have for dessert for lunch?
- Gold Child Response: a ice cream sandwich
- llm-toddler-30: i had some spaghettis.
- Caregiver Utterance: What did we do with the toy?
- Gold Child Response: we put it in his cage
- llm-toddler-30: it broke.
- Caregiver Utterance: Where is your bag?
- Gold Child Response: I didn't bring it. It's in the car
- llm-toddler-30: i got it
- Caregiver Utterance: What did you find?
- Gold Child Response: look. Elmo
- llm-toddler-30: i found luke's lightsaber.
- Caregiver Utterance: What did you put on the pizza?
- Gold Child Response: cheese and cheese
- llm-toddler-30: i put sause on it
Evaluation
- Human evaluation (19 participants; two robot platforms: Cozmo, Misty II):
- Perceived age: ~3 years (closer to target for our model)
- Coherence: comparable across models; higher with Cozmo than Misty on average
- AoA and vocabulary: humans used broader vocabulary and higher AoA; models remained lower AoA as intended; some SmolLM variants occasionally produced adult-level content
- Notable: Participant expectations matched Cozmo’s child-like morphology/voice better than Misty’s
Intended use
- Research on child-like conversational agents and human-robot interaction
- Simulated child responses to caregiver prompts
Out-of-scope and limitations
- Not for clinical, diagnostic, educational placement, or childcare decision-making
- English-only; small corpus (≈14M tokens); limited world knowledge
- Can produce off-context, random child-like words; may fixate on certain “baby words”
- May generate age-inappropriate content in rare cases; monitor outputs
- Sensitive to prompt phrasing; best with concise caregiver questions
Safety and ethical considerations
- Use responsibly around minors; ensure adult supervision in interactive settings
- Avoid anthropomorphizing beyond research context
- Respect CHILDES data licenses and privacy norms
- Models may reflect biases or artifacts from child-directed corpora