tinyGemma Urdu

Trained a 0.96 million parameters Urdu Gemma.

Architecture

A version of Google's Gemma architecture with the following components as defined in GemmaConfig:

  • GemmaAttention: Multi-head attention with grouped query attention (num_queries_per_kv), RoPE positional embeddings via apply_rotary_emb(), and causal masking using pre-computed triangular mask
  • GemmaMLP: Feed-forward network with GELU activation implementing gate_proj * up_proj gating mechanism through down_proj
  • GemmaDecoderLayer: Transformer block combining self_attn and mlp with pre-normalization using RMSNorm
  • RMSNorm: Root Mean Square Layer Normalization with optional unit offset (add_unit_offset=True) and learnable weight parameter
  • tinyGemma: Complete model with embedder scaled by sqrt(hidden_size) and tied weights for language modeling head

Training Results

Achieved convergence on Urdu corpus with the following performance metrics:

Final Training Metrics (5000 iterations):
- Training Loss: 2.7668
- Validation Loss: 2.9250  
- Validation Perplexity: 18.6348
- Learning Rate: 3e-4 with AdamW optimizer
- Batch Size: 16 with 2 gradient accumulation steps

Loss Curves

Train and Val loss curves

License

MIT License

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support