Nano Kimi K2 Style MoE

A scaled-down MoE model inspired by Kimi K2's shared expert architecture.

Model Details

  • Parameters: 234.8M total
  • Active Parameters per Token: ~90M (40% of total)
  • Architecture:
    • 12 transformer layers
    • Layers 4-9: MoE with 4 routed experts + 1 shared expert
    • Layers 0-3, 10-11: Dense transformer
  • Routing: Top-2 expert selection
  • Training: 100 steps

Architecture Highlights

Shared Expert Mechanism (DeepSeek-V3 style):

  • 1 always-active shared expert (baseline knowledge)
  • 4 routed experts (specialization)
  • Top-2 routing with load balancing

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Abduali/nano-moe-mla", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("gpt2")

prompt = "The future of AI"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
output = model.generate(input_ids, max_new_tokens=50)
print(tokenizer.decode(output[0]))

Training Details

  • Optimizer: AdamW (lr=3e-4, β1=0.9, β2=0.95)
  • Load balancing loss coefficient: 0.01
  • Max sequence length: 2048
  • Gradient clipping: 1.0

Citation

If you use this model, please cite:

@misc{nano-moe-2024,
  title={Nano Kimi K2 Style MoE},
  author={Your Name},
  year={2024},
  howpublished={\url{Abduali/nano-moe-mla}}
}

License

Apache 2.0

Downloads last month
25
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support