Nano Kimi K2 Style MoE

A scaled-down MoE model inspired by Kimi K2's shared expert architecture.

Model Details

Parameters: 234.8M total
Active Parameters per Token: ~90M (40% of total)
Architecture:
- 12 transformer layers
- Layers 4-9: MoE with 4 routed experts + 1 shared expert
- Layers 0-3, 10-11: Dense transformer
Routing: Top-2 expert selection
Training: 100 steps

Architecture Highlights

Shared Expert Mechanism (DeepSeek-V3 style):

1 always-active shared expert (baseline knowledge)
4 routed experts (specialization)
Top-2 routing with load balancing

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Abduali/nano-moe-mla", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("gpt2")

prompt = "The future of AI"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
output = model.generate(input_ids, max_new_tokens=50)
print(tokenizer.decode(output[0]))

Training Details

Optimizer: AdamW (lr=3e-4, β1=0.9, β2=0.95)
Load balancing loss coefficient: 0.01
Max sequence length: 2048
Gradient clipping: 1.0

Citation

If you use this model, please cite:

@misc{nano-moe-2024,
  title={Nano Kimi K2 Style MoE},
  author={Your Name},
  year={2024},
  howpublished={\url{Abduali/nano-moe-mla}}
}

License

Apache 2.0

Downloads last month: 25

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support