|
|
--- |
|
|
language: en |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- i3-architecture |
|
|
- hybrid-model |
|
|
- rwkv-mamba |
|
|
- custom_code |
|
|
datasets: |
|
|
- agentlans/high-quality-english-sentences |
|
|
- roneneldan/TinyStories |
|
|
- starhopp3r/TinyChat |
|
|
library_name: transformers |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
# i3-80M - Hybrid Architecture Language Model |
|
|
|
|
|
## Model Description |
|
|
|
|
|
The **i3-80M Model** is a novel hybrid architecture combining convolutional/recurrent layers with full attention layers for efficient language modeling. This architecture uniquely blends RWKV-style time-mixing with Mamba state-space dynamics in the early layers, followed by standard multi-head attention in deeper layers. |
|
|
|
|
|
This is the second model in the i3 series, scaling up from the original [i3-22M](https://huggingface.co/FlameF0X/i3-22m) with improved architecture and multi-dataset training. |
|
|
|
|
|
> [!NOTE] |
|
|
> To use the model try it [here](https://huggingface.co/spaces/FlameF0X/i3-80m). |
|
|
> |
|
|
> [Citește aici în Română :)](https://huggingface.co/FlameF0X/i3-80m/blob/main/CITE%C8%98TEM%C4%82.md) |
|
|
|
|
|
## Model Statistics |
|
|
|
|
|
- **Total Parameters**: ~82.77M (82,765,160) |
|
|
- **Architecture**: 10 Hybrid (RWKV-Mamba) + 6 Full Attention Layers = 16 Total Layers |
|
|
- **Vocabulary Size**: 35,560 tokens (variable-length chunks with <UNK> token) |
|
|
- **Hidden Dimension (d_model)**: 512 |
|
|
- **Attention Heads**: 16 |
|
|
- **State Dimension (d_state)**: 32 |
|
|
- **Max Sequence Length**: 256 |
|
|
- **Tokenization**: Memory-efficient variable-length chunking (2-3 characters) |
|
|
|
|
|
### Architecture Breakdown |
|
|
``` |
|
|
Layers 1-10: RWKV-Mamba Hybrid Blocks (Recurrent/Conv) |
|
|
├─ RWKVMambaHybrid (Time-mixing + State-space) |
|
|
└─ Feed-Forward Network (4x expansion) |
|
|
|
|
|
Layers 11-16: Full Attention Blocks |
|
|
├─ Multi-Head Attention (16 heads) |
|
|
└─ Feed-Forward Network (4x expansion) |
|
|
``` |
|
|
|
|
|
## Comparison with i3-22M |
|
|
|
|
|
| Feature | i3-22M | i3-80M (This Model) | |
|
|
|---------|--------|---------------------| |
|
|
| **Parameters** | 22.6M | 82.77M | |
|
|
| **Architecture** | 24 Hybrid Layers | 10 Hybrid + 6 Attention Layers | |
|
|
| **Hidden Dimension** | 512 | 512 | |
|
|
| **Vocabulary Size** | 4,466 | 35,560 | |
|
|
| **Training Dataset** | TinyChat only | TinyStories + TinyChat + HQ Sentences | |
|
|
| **Total Tokens** | ~1M conversations | ~3M+ tokens | |
|
|
| **Final Loss** | ~2.0 | ~2.0 | |
|
|
| **Final Perplexity** | 7.29-9.70 | 7.29-10.0 | |
|
|
| **Training Time** | ~17 hours | ~2-4 hours | |
|
|
| **Attention Layers** | None (Pure Hybrid) | 6 Full Attention Layers | |
|
|
|
|
|
### Key Improvements Over i3-22M |
|
|
|
|
|
1. **Hybrid Architecture**: Introduces full multi-head attention in upper layers for better long-range dependencies |
|
|
2. **Larger Vocabulary**: 8x larger vocabulary (35,560 vs 4,466) for better token coverage |
|
|
3. **Multi-Dataset Training**: Trained on 3 diverse datasets vs single dataset |
|
|
4. **Better Generalization**: Exposure to narratives (TinyStories), conversations (TinyChat), and formal text (HQ Sentences) |
|
|
5. **Enhanced Unknown Token Handling**: Robust <UNK> token system for out-of-vocabulary words |
|
|
|
|
|
### When to Use Each Model |
|
|
|
|
|
**Use i3-22M if you need:** |
|
|
- Smaller model size (~22M params) |
|
|
- Pure conversational focus (TinyChat specialized) |
|
|
- Lower memory footprint |
|
|
- Faster inference |
|
|
|
|
|
**Use i3-80M if you need:** |
|
|
- Better general-purpose text generation |
|
|
- Stronger attention-based reasoning (6 attention layers) |
|
|
- Larger vocabulary coverage |
|
|
- Multi-domain text understanding (stories, chat, formal text) |
|
|
|
|
|
### Key Features |
|
|
|
|
|
1. **Hybrid Architecture**: Combines the efficiency of recurrent/convolutional processing with the power of attention |
|
|
- Early layers use RWKV-Mamba hybrid for efficient sequence processing |
|
|
- Later layers use full multi-head attention for complex pattern recognition |
|
|
|
|
|
2. **Memory-Optimized Training**: |
|
|
- Streaming vocabulary building (no full text storage) |
|
|
- Vocabulary caching (build once, reuse) |
|
|
- Efficient chunk frequency counting |
|
|
- Automatic memory cleanup |
|
|
|
|
|
3. **Multi-Dataset Pre-training**: Trained on diverse text sources for robust language understanding |
|
|
- TinyStories: Narrative and storytelling |
|
|
- TinyChat: Conversational dynamics |
|
|
- High-Quality English Sentences: Linguistic diversity |
|
|
|
|
|
4. **Smart Tokenization**: Variable-length chunking (2-3 chars) with common trigram optimization |
|
|
- Total tokens processed: **3,000,000+** |
|
|
- Handles unknown tokens gracefully with <UNK> token |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Configuration |
|
|
|
|
|
- **Datasets**: |
|
|
- `agentlans/high-quality-english-sentences` |
|
|
- `roneneldan/TinyStories` |
|
|
- `starhopp3r/TinyChat` |
|
|
- **Training Steps**: 5,000 iterations |
|
|
- **Batch Size**: 4 (with gradient accumulation support) |
|
|
- **Learning Rate**: 3e-4 (with warmup and cosine decay) |
|
|
- **Optimizer**: AdamW with gradient clipping (max norm: 1.0) |
|
|
- **Hardware**: NVIDIA P100 (16GB VRAM) |
|
|
- **Training Time**: ~2-4 hours |
|
|
- **Framework**: PyTorch |
|
|
|
|
|
### Training Dynamics |
|
|
|
|
|
- **GPU Utilization**: Stable at ~15-20% during training |
|
|
- **GPU Memory**: \~18% allocated (~2.2GB / 12GB) |
|
|
- **Power Usage**: ~40W average |
|
|
- **Throughput**: ~100-550 tokens/sec |
|
|
|
|
|
### Performance Metrics |
|
|
|
|
|
| Metric | Initial | Final | |
|
|
|--------|---------|-------| |
|
|
| Training Loss | ~10.0 | ~1.7 | |
|
|
| Perplexity | ~4000+ | ~6 | |
|
|
|
|
|
 |
|
|
> [!NOTE] |
|
|
> I dont know why the logging starts at step 4.6k . |
|
|
|
|
|
**i3-22m** and **i3-80m** comparation? |
|
|
|
|
|
 |
|
|
|
|
|
The model shows strong convergence with stable training dynamics and efficient GPU utilization. |
|
|
|
|
|
## Usage |
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
# Load model and tokenizer |
|
|
model = AutoModelForCausalLM.from_pretrained("FlameF0X/i3-80m") |
|
|
tokenizer = AutoTokenizer.from_pretrained("FlameF0X/i3-80m") |
|
|
|
|
|
# Generate text |
|
|
prompt = "hello" |
|
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
|
outputs = model.generate( |
|
|
inputs.input_ids, |
|
|
max_length=100, |
|
|
temperature=0.8, |
|
|
top_k=40 |
|
|
) |
|
|
generated_text = tokenizer.decode(outputs[0]) |
|
|
print(generated_text) |
|
|
``` |
|
|
|
|
|
|
|
|
## Technical Innovations |
|
|
|
|
|
1. **RWKV-Mamba Hybrid Recurrence**: Combines RWKV's time-mixing with Mamba's state-space dynamics |
|
|
- Linear complexity for long sequences |
|
|
- Efficient recurrent processing |
|
|
- State-space modeling for temporal dependencies |
|
|
|
|
|
2. **Hierarchical Processing**: |
|
|
- Lower layers focus on local patterns (conv/recurrent) |
|
|
- Upper layers capture global dependencies (attention) |
|
|
|
|
|
3. **Memory Efficiency**: |
|
|
- Streaming tokenization during vocab building |
|
|
- No full dataset storage in RAM |
|
|
- Automatic cleanup of intermediate data |
|
|
|
|
|
## Model Files |
|
|
|
|
|
- `pytorch_model.bin`: Model weights |
|
|
- `config.json`: Model configuration |
|
|
- `chunk_vocab_combined.json`: Tokenizer vocabulary |
|
|
|
|
|
## Training Tracking |
|
|
|
|
|
This model was tracked using Weights & Biases (WandB) with comprehensive metrics: |
|
|
- Real-time loss and perplexity tracking |
|
|
- Gradient norm monitoring |
|
|
- Learning rate scheduling visualization |
|
|
- Generation samples logged to tables |
|
|
- Model checkpoints as artifacts |
|
|
- System resource monitoring |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Trained on English text only |
|
|
- Limited to 256 token context window |
|
|
- May require fine-tuning for specific downstream tasks |
|
|
- Conversational style influenced by TinyChat dataset |
|
|
|
|
|
## Model Series |
|
|
|
|
|
- [i3-22M](https://huggingface.co/FlameF0X/i3-22m) - Original model with pure hybrid architecture |
|
|
- **i3-80M** (This model) - Scaled version with attention layers and multi-dataset training |
|
|
|
|
|
## Citation |
|
|
```bibtex |
|
|
@misc{i3-80m, |
|
|
author = {FlameF0X}, |
|
|
title = {i3-80M: Hybrid Architecture Language Model}, |
|
|
year = {2025}, |
|
|
publisher = {HuggingFace}, |
|
|
howpublished = {\url{https://huggingface.co/FlameF0X/i3-80m}} |
|
|
} |
|
|
``` |
|
|
```bibtex |
|
|
@article{mamba, |
|
|
title={Mamba: Linear-Time Sequence Modeling with Selective State Spaces}, |
|
|
author={Gu, Albert and Dao, Tri}, |
|
|
journal={arXiv preprint arXiv:2312.00752}, |
|
|
year={2023} |
|
|
} |
|
|
@article{RWKV, |
|
|
title={RWKV: Reinventing RNNs for the Transformer Era}, |
|
|
author={Peng, Bo and others}, |
|
|
journal={arXiv preprint arXiv:2305.13048}, |
|
|
year={2023} |
|
|
} |
|
|
|
|
|
``` |