Nano Kimi K2 Style MoE
A scaled-down MoE model inspired by Kimi K2's shared expert architecture.
Model Details
- Parameters: 234.8M total
- Active Parameters per Token: ~90M (40% of total)
- Architecture:
- 12 transformer layers
- Layers 4-9: MoE with 4 routed experts + 1 shared expert
- Layers 0-3, 10-11: Dense transformer
- Routing: Top-2 expert selection
- Training: 100 steps
Architecture Highlights
Shared Expert Mechanism (DeepSeek-V3 style):
- 1 always-active shared expert (baseline knowledge)
- 4 routed experts (specialization)
- Top-2 routing with load balancing
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Abduali/nano-moe-mla", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("gpt2")
prompt = "The future of AI"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
output = model.generate(input_ids, max_new_tokens=50)
print(tokenizer.decode(output[0]))
Training Details
- Optimizer: AdamW (lr=3e-4, β1=0.9, β2=0.95)
- Load balancing loss coefficient: 0.01
- Max sequence length: 2048
- Gradient clipping: 1.0
Citation
If you use this model, please cite:
@misc{nano-moe-2024,
title={Nano Kimi K2 Style MoE},
author={Your Name},
year={2024},
howpublished={\url{Abduali/nano-moe-mla}}
}
License
Apache 2.0
- Downloads last month
- 25
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support