Text Generation
Transformers
PyTorch
Safetensors
English
i3
i3-architecture
hybrid-model
rwkv-mamba
custom_code
i3-80m / README.md
FlameF0X's picture
Update README.md
8800d78 verified
---
language: en
license: apache-2.0
tags:
- i3-architecture
- hybrid-model
- rwkv-mamba
- custom_code
datasets:
- agentlans/high-quality-english-sentences
- roneneldan/TinyStories
- starhopp3r/TinyChat
library_name: transformers
pipeline_tag: text-generation
---
# i3-80M - Hybrid Architecture Language Model
## Model Description
The **i3-80M Model** is a novel hybrid architecture combining convolutional/recurrent layers with full attention layers for efficient language modeling. This architecture uniquely blends RWKV-style time-mixing with Mamba state-space dynamics in the early layers, followed by standard multi-head attention in deeper layers.
This is the second model in the i3 series, scaling up from the original [i3-22M](https://huggingface.co/FlameF0X/i3-22m) with improved architecture and multi-dataset training.
> [!NOTE]
> To use the model try it [here](https://huggingface.co/spaces/FlameF0X/i3-80m).
>
> [Citește aici în Română :)](https://huggingface.co/FlameF0X/i3-80m/blob/main/CITE%C8%98TEM%C4%82.md)
## Model Statistics
- **Total Parameters**: ~82.77M (82,765,160)
- **Architecture**: 10 Hybrid (RWKV-Mamba) + 6 Full Attention Layers = 16 Total Layers
- **Vocabulary Size**: 35,560 tokens (variable-length chunks with <UNK> token)
- **Hidden Dimension (d_model)**: 512
- **Attention Heads**: 16
- **State Dimension (d_state)**: 32
- **Max Sequence Length**: 256
- **Tokenization**: Memory-efficient variable-length chunking (2-3 characters)
### Architecture Breakdown
```
Layers 1-10: RWKV-Mamba Hybrid Blocks (Recurrent/Conv)
├─ RWKVMambaHybrid (Time-mixing + State-space)
└─ Feed-Forward Network (4x expansion)
Layers 11-16: Full Attention Blocks
├─ Multi-Head Attention (16 heads)
└─ Feed-Forward Network (4x expansion)
```
## Comparison with i3-22M
| Feature | i3-22M | i3-80M (This Model) |
|---------|--------|---------------------|
| **Parameters** | 22.6M | 82.77M |
| **Architecture** | 24 Hybrid Layers | 10 Hybrid + 6 Attention Layers |
| **Hidden Dimension** | 512 | 512 |
| **Vocabulary Size** | 4,466 | 35,560 |
| **Training Dataset** | TinyChat only | TinyStories + TinyChat + HQ Sentences |
| **Total Tokens** | ~1M conversations | ~3M+ tokens |
| **Final Loss** | ~2.0 | ~2.0 |
| **Final Perplexity** | 7.29-9.70 | 7.29-10.0 |
| **Training Time** | ~17 hours | ~2-4 hours |
| **Attention Layers** | None (Pure Hybrid) | 6 Full Attention Layers |
### Key Improvements Over i3-22M
1. **Hybrid Architecture**: Introduces full multi-head attention in upper layers for better long-range dependencies
2. **Larger Vocabulary**: 8x larger vocabulary (35,560 vs 4,466) for better token coverage
3. **Multi-Dataset Training**: Trained on 3 diverse datasets vs single dataset
4. **Better Generalization**: Exposure to narratives (TinyStories), conversations (TinyChat), and formal text (HQ Sentences)
5. **Enhanced Unknown Token Handling**: Robust <UNK> token system for out-of-vocabulary words
### When to Use Each Model
**Use i3-22M if you need:**
- Smaller model size (~22M params)
- Pure conversational focus (TinyChat specialized)
- Lower memory footprint
- Faster inference
**Use i3-80M if you need:**
- Better general-purpose text generation
- Stronger attention-based reasoning (6 attention layers)
- Larger vocabulary coverage
- Multi-domain text understanding (stories, chat, formal text)
### Key Features
1. **Hybrid Architecture**: Combines the efficiency of recurrent/convolutional processing with the power of attention
- Early layers use RWKV-Mamba hybrid for efficient sequence processing
- Later layers use full multi-head attention for complex pattern recognition
2. **Memory-Optimized Training**:
- Streaming vocabulary building (no full text storage)
- Vocabulary caching (build once, reuse)
- Efficient chunk frequency counting
- Automatic memory cleanup
3. **Multi-Dataset Pre-training**: Trained on diverse text sources for robust language understanding
- TinyStories: Narrative and storytelling
- TinyChat: Conversational dynamics
- High-Quality English Sentences: Linguistic diversity
4. **Smart Tokenization**: Variable-length chunking (2-3 chars) with common trigram optimization
- Total tokens processed: **3,000,000+**
- Handles unknown tokens gracefully with <UNK> token
## Training Details
### Training Configuration
- **Datasets**:
- `agentlans/high-quality-english-sentences`
- `roneneldan/TinyStories`
- `starhopp3r/TinyChat`
- **Training Steps**: 5,000 iterations
- **Batch Size**: 4 (with gradient accumulation support)
- **Learning Rate**: 3e-4 (with warmup and cosine decay)
- **Optimizer**: AdamW with gradient clipping (max norm: 1.0)
- **Hardware**: NVIDIA P100 (16GB VRAM)
- **Training Time**: ~2-4 hours
- **Framework**: PyTorch
### Training Dynamics
- **GPU Utilization**: Stable at ~15-20% during training
- **GPU Memory**: \~18% allocated (~2.2GB / 12GB)
- **Power Usage**: ~40W average
- **Throughput**: ~100-550 tokens/sec
### Performance Metrics
| Metric | Initial | Final |
|--------|---------|-------|
| Training Loss | ~10.0 | ~1.7 |
| Perplexity | ~4000+ | ~6 |
![image](https://cdn-uploads.huggingface.co/production/uploads/6615494716917dfdc645c44e/ugtJGyEkQfbGieURP2W78.png)
> [!NOTE]
> I dont know why the logging starts at step 4.6k .
**i3-22m** and **i3-80m** comparation?
![image](https://cdn-uploads.huggingface.co/production/uploads/6615494716917dfdc645c44e/utj6B7AE_gMMI9jnHc37Z.png)
The model shows strong convergence with stable training dynamics and efficient GPU utilization.
## Usage
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("FlameF0X/i3-80m")
tokenizer = AutoTokenizer.from_pretrained("FlameF0X/i3-80m")
# Generate text
prompt = "hello"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
inputs.input_ids,
max_length=100,
temperature=0.8,
top_k=40
)
generated_text = tokenizer.decode(outputs[0])
print(generated_text)
```
## Technical Innovations
1. **RWKV-Mamba Hybrid Recurrence**: Combines RWKV's time-mixing with Mamba's state-space dynamics
- Linear complexity for long sequences
- Efficient recurrent processing
- State-space modeling for temporal dependencies
2. **Hierarchical Processing**:
- Lower layers focus on local patterns (conv/recurrent)
- Upper layers capture global dependencies (attention)
3. **Memory Efficiency**:
- Streaming tokenization during vocab building
- No full dataset storage in RAM
- Automatic cleanup of intermediate data
## Model Files
- `pytorch_model.bin`: Model weights
- `config.json`: Model configuration
- `chunk_vocab_combined.json`: Tokenizer vocabulary
## Training Tracking
This model was tracked using Weights & Biases (WandB) with comprehensive metrics:
- Real-time loss and perplexity tracking
- Gradient norm monitoring
- Learning rate scheduling visualization
- Generation samples logged to tables
- Model checkpoints as artifacts
- System resource monitoring
## Limitations
- Trained on English text only
- Limited to 256 token context window
- May require fine-tuning for specific downstream tasks
- Conversational style influenced by TinyChat dataset
## Model Series
- [i3-22M](https://huggingface.co/FlameF0X/i3-22m) - Original model with pure hybrid architecture
- **i3-80M** (This model) - Scaled version with attention layers and multi-dataset training
## Citation
```bibtex
@misc{i3-80m,
author = {FlameF0X},
title = {i3-80M: Hybrid Architecture Language Model},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/FlameF0X/i3-80m}}
}
```
```bibtex
@article{mamba,
title={Mamba: Linear-Time Sequence Modeling with Selective State Spaces},
author={Gu, Albert and Dao, Tri},
journal={arXiv preprint arXiv:2312.00752},
year={2023}
}
@article{RWKV,
title={RWKV: Reinventing RNNs for the Transformer Era},
author={Peng, Bo and others},
journal={arXiv preprint arXiv:2305.13048},
year={2023}
}
```