tinyGemma Urdu

Trained a 0.96 million parameters Urdu Gemma.

Gemma Paper: https://arxiv.org/abs/2503.19786 - Core architecture and design principles
RMSNorm: https://arxiv.org/abs/1910.07467 - Root Mean Square Layer Normalization
RoPE: https://arxiv.org/abs/2104.09864 - Rotary Position Embedding methodology
Grouped Query Attention: https://arxiv.org/abs/2305.13245 - Memory efficient attention mechanism
SwiGLU/GELU: https://arxiv.org/abs/2002.05202 - Gated linear unit activations

Architecture

A version of Google's Gemma architecture with the following components as defined in GemmaConfig:

GemmaAttention: Multi-head attention with grouped query attention (num_queries_per_kv), RoPE positional embeddings via apply_rotary_emb(), and causal masking using pre-computed triangular mask
GemmaMLP: Feed-forward network with GELU activation implementing gate_proj * up_proj gating mechanism through down_proj
GemmaDecoderLayer: Transformer block combining self_attn and mlp with pre-normalization using RMSNorm
RMSNorm: Root Mean Square Layer Normalization with optional unit offset (add_unit_offset=True) and learnable weight parameter
tinyGemma: Complete model with embedder scaled by sqrt(hidden_size) and tied weights for language modeling head

Training Results

Achieved convergence on Urdu corpus with the following performance metrics:

Final Training Metrics (5000 iterations):
- Training Loss: 2.7668
- Validation Loss: 2.9250  
- Validation Perplexity: 18.6348
- Learning Rate: 3e-4 with AdamW optimizer
- Batch Size: 16 with 2 gradient accumulation steps

Loss Curves

License

MIT License

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support