i3-80m / README.md

Update README.md

8800d78 verified 5 days ago

8.16 kB

	---
	language: en
	license: apache-2.0
	tags:
	- i3-architecture
	- hybrid-model
	- rwkv-mamba
	- custom_code
	datasets:
	- agentlans/high-quality-english-sentences
	- roneneldan/TinyStories
	- starhopp3r/TinyChat
	library_name: transformers
	pipeline_tag: text-generation
	---

	# i3-80M - Hybrid Architecture Language Model

	## Model Description

	The i3-80M Model is a novel hybrid architecture combining convolutional/recurrent layers with full attention layers for efficient language modeling. This architecture uniquely blends RWKV-style time-mixing with Mamba state-space dynamics in the early layers, followed by standard multi-head attention in deeper layers.

	This is the second model in the i3 series, scaling up from the original [i3-22M](https://huggingface.co/FlameF0X/i3-22m) with improved architecture and multi-dataset training.

	> [!NOTE]
	> To use the model try it [here](https://huggingface.co/spaces/FlameF0X/i3-80m).
	>
	> [Citește aici în Română :)](https://huggingface.co/FlameF0X/i3-80m/blob/main/CITE%C8%98TEM%C4%82.md)

	## Model Statistics

	- Total Parameters: ~82.77M (82,765,160)
	- Architecture: 10 Hybrid (RWKV-Mamba) + 6 Full Attention Layers = 16 Total Layers
	- Vocabulary Size: 35,560 tokens (variable-length chunks with <UNK> token)
	- Hidden Dimension (d_model): 512
	- Attention Heads: 16
	- State Dimension (d_state): 32
	- Max Sequence Length: 256
	- Tokenization: Memory-efficient variable-length chunking (2-3 characters)

	### Architecture Breakdown
	```
	Layers 1-10: RWKV-Mamba Hybrid Blocks (Recurrent/Conv)
	├─ RWKVMambaHybrid (Time-mixing + State-space)
	└─ Feed-Forward Network (4x expansion)

	Layers 11-16: Full Attention Blocks
	├─ Multi-Head Attention (16 heads)
	└─ Feed-Forward Network (4x expansion)
	```

	## Comparison with i3-22M

	\| Feature \| i3-22M \| i3-80M (This Model) \|
	\|---------\|--------\|---------------------\|
	\| Parameters \| 22.6M \| 82.77M \|
	\| Architecture \| 24 Hybrid Layers \| 10 Hybrid + 6 Attention Layers \|
	\| Hidden Dimension \| 512 \| 512 \|
	\| Vocabulary Size \| 4,466 \| 35,560 \|
	\| Training Dataset \| TinyChat only \| TinyStories + TinyChat + HQ Sentences \|
	\| Total Tokens \| ~1M conversations \| ~3M+ tokens \|
	\| Final Loss \| ~2.0 \| ~2.0 \|
	\| Final Perplexity \| 7.29-9.70 \| 7.29-10.0 \|
	\| Training Time \| ~17 hours \| ~2-4 hours \|
	\| Attention Layers \| None (Pure Hybrid) \| 6 Full Attention Layers \|

	### Key Improvements Over i3-22M

	1. Hybrid Architecture: Introduces full multi-head attention in upper layers for better long-range dependencies
	2. Larger Vocabulary: 8x larger vocabulary (35,560 vs 4,466) for better token coverage
	3. Multi-Dataset Training: Trained on 3 diverse datasets vs single dataset
	4. Better Generalization: Exposure to narratives (TinyStories), conversations (TinyChat), and formal text (HQ Sentences)
	5. Enhanced Unknown Token Handling: Robust <UNK> token system for out-of-vocabulary words

	### When to Use Each Model

	Use i3-22M if you need:
	- Smaller model size (~22M params)
	- Pure conversational focus (TinyChat specialized)
	- Lower memory footprint
	- Faster inference

	Use i3-80M if you need:
	- Better general-purpose text generation
	- Stronger attention-based reasoning (6 attention layers)
	- Larger vocabulary coverage
	- Multi-domain text understanding (stories, chat, formal text)

	### Key Features

	1. Hybrid Architecture: Combines the efficiency of recurrent/convolutional processing with the power of attention
	- Early layers use RWKV-Mamba hybrid for efficient sequence processing
	- Later layers use full multi-head attention for complex pattern recognition

	2. Memory-Optimized Training:
	- Streaming vocabulary building (no full text storage)
	- Vocabulary caching (build once, reuse)
	- Efficient chunk frequency counting
	- Automatic memory cleanup

	3. Multi-Dataset Pre-training: Trained on diverse text sources for robust language understanding
	- TinyStories: Narrative and storytelling
	- TinyChat: Conversational dynamics
	- High-Quality English Sentences: Linguistic diversity

	4. Smart Tokenization: Variable-length chunking (2-3 chars) with common trigram optimization
	- Total tokens processed: 3,000,000+
	- Handles unknown tokens gracefully with <UNK> token

	## Training Details

	### Training Configuration

	- Datasets:
	- `agentlans/high-quality-english-sentences`
	- `roneneldan/TinyStories`
	- `starhopp3r/TinyChat`
	- Training Steps: 5,000 iterations
	- Batch Size: 4 (with gradient accumulation support)
	- Learning Rate: 3e-4 (with warmup and cosine decay)
	- Optimizer: AdamW with gradient clipping (max norm: 1.0)
	- Hardware: NVIDIA P100 (16GB VRAM)
	- Training Time: ~2-4 hours
	- Framework: PyTorch

	### Training Dynamics

	- GPU Utilization: Stable at ~15-20% during training
	- GPU Memory: \~18% allocated (~2.2GB / 12GB)
	- Power Usage: ~40W average
	- Throughput: ~100-550 tokens/sec

	### Performance Metrics

	\| Metric \| Initial \| Final \|
	\|--------\|---------\|-------\|
	\| Training Loss \| ~10.0 \| ~1.7 \|
	\| Perplexity \| ~4000+ \| ~6 \|

	![image](https://cdn-uploads.huggingface.co/production/uploads/6615494716917dfdc645c44e/ugtJGyEkQfbGieURP2W78.png)
	> [!NOTE]
	> I dont know why the logging starts at step 4.6k .

	i3-22m and i3-80m comparation?

	![image](https://cdn-uploads.huggingface.co/production/uploads/6615494716917dfdc645c44e/utj6B7AE_gMMI9jnHc37Z.png)

	The model shows strong convergence with stable training dynamics and efficient GPU utilization.

	## Usage
	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	# Load model and tokenizer
	model = AutoModelForCausalLM.from_pretrained("FlameF0X/i3-80m")
	tokenizer = AutoTokenizer.from_pretrained("FlameF0X/i3-80m")

	# Generate text
	prompt = "hello"
	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(
	inputs.input_ids,
	max_length=100,
	temperature=0.8,
	top_k=40
	)
	generated_text = tokenizer.decode(outputs[0])
	print(generated_text)
	```


	## Technical Innovations

	1. RWKV-Mamba Hybrid Recurrence: Combines RWKV's time-mixing with Mamba's state-space dynamics
	- Linear complexity for long sequences
	- Efficient recurrent processing
	- State-space modeling for temporal dependencies

	2. Hierarchical Processing:
	- Lower layers focus on local patterns (conv/recurrent)
	- Upper layers capture global dependencies (attention)

	3. Memory Efficiency:
	- Streaming tokenization during vocab building
	- No full dataset storage in RAM
	- Automatic cleanup of intermediate data

	## Model Files

	- `pytorch_model.bin`: Model weights
	- `config.json`: Model configuration
	- `chunk_vocab_combined.json`: Tokenizer vocabulary

	## Training Tracking

	This model was tracked using Weights & Biases (WandB) with comprehensive metrics:
	- Real-time loss and perplexity tracking
	- Gradient norm monitoring
	- Learning rate scheduling visualization
	- Generation samples logged to tables
	- Model checkpoints as artifacts
	- System resource monitoring

	## Limitations

	- Trained on English text only
	- Limited to 256 token context window
	- May require fine-tuning for specific downstream tasks
	- Conversational style influenced by TinyChat dataset

	## Model Series

	- [i3-22M](https://huggingface.co/FlameF0X/i3-22m) - Original model with pure hybrid architecture
	- i3-80M (This model) - Scaled version with attention layers and multi-dataset training

	## Citation
	```bibtex
	@misc{i3-80m,
	author = {FlameF0X},
	title = {i3-80M: Hybrid Architecture Language Model},
	year = {2025},
	publisher = {HuggingFace},
	howpublished = {\url{https://huggingface.co/FlameF0X/i3-80m}}
	}
	```
	```bibtex
	@article{mamba,
	title={Mamba: Linear-Time Sequence Modeling with Selective State Spaces},
	author={Gu, Albert and Dao, Tri},
	journal={arXiv preprint arXiv:2312.00752},
	year={2023}
	}
	@article{RWKV,
	title={RWKV: Reinventing RNNs for the Transformer Era},
	author={Peng, Bo and others},
	journal={arXiv preprint arXiv:2305.13048},
	year={2023}
	}

	```