Kimi-K2-Mini

#1
by PSM24 - opened

Are there any plans to create a miniature model like Llama-4 Scout (100-200B params)?

100B MoE would be awesome, I can run it pretty decently on 32gb vram + 60gb ddr5 ram

100B with 1B active for cpu bound and mlx

30B with maybe ~3B or ~6B active

I find it really frustrating that so many focus on making the models bigger and bigger. Many have asked for a smaller DeepSeek that can be run locally.

A smaller MoE on the deepseek architecture that fit on consumer hardware (8 GB VRAM, 32 GB RAM) like a 30-50b total parameter moe with around 5-12b activated parameters would be appreciated.

Hunyuan?

Moonshot AI org

@PSM24 We have an much smaller modal: https://huggingface.co/moonshotai/Kimi-VL-A3B-Instruct , but it may be too small for you.

We don't have plan to provider a medium size model now.

I hope you’ll consider this in the future, relatively few people have the resources to run Kimi-K2, whereas models around 100B are significantly more accessible for home computer setups.

@PSM24 We have an much smaller modal: https://huggingface.co/moonshotai/Kimi-VL-A3B-Instruct , but it may be too small for you.

We don't have plan to provider a medium size model now.

Thank you for answering our enquiries. That's unfortunate to hear. People have been wanting a model on the full DeepSeek architecture that can be run on their home computers for so long, but R1 and now this one are just way too big. Most computers have around 32 GB RAM so a MoE with total parameters between 30-60b would be great, if you ever reconsider this.

@PSM24 We have an much smaller modal: https://huggingface.co/moonshotai/Kimi-VL-A3B-Instruct , but it may be too small for you.

We don't have plan to provider a medium size model now.

That is unfortunate. Great job on Kimi-K2!

Moonshot AI org

Pruning or distilling a mid-sized model from a large one is far easier—and more efficient—than training from scratch. Now that Kimi-K2 is open-source, we’re delighted to see users and organizations already taking on this challenge. Let's look forward to their achievements together
.

the community is already working on it: https://reddit.com/r/LocalLLaMA/comments/1ly9iqw/k2mini_successfully_compressed_kimik2_from_107t/

From what I can tell it's mostly vibe-coded and I don't have too much hope that it'll perform well.

I think what would be more interesting is what the a-m-team will do with the model. They've previously created quite a few large-scale, high quality datasets like from R1-0528 and Qwen3 235B, so on the distillation side, that'll be pretty interesting.

I find it really frustrating that so many focus on making the models bigger and bigger. Many have asked for a smaller DeepSeek that can be run locally.

A smaller MoE on the deepseek architecture that fit on consumer hardware (8 GB VRAM, 32 GB RAM) like a 30-50b total parameter moe with around 5-12b activated parameters would be appreciated.

you dont get it. this is fantastic this is just the first step because its extreamly difficult to get models to do this so they go bigger to compensate for the training and tool calling. this model is not the end model. Deepseek r1 was abreakthrough for open source but you dont run it loccaly however because of it we have 8gb distilled versions of qwen3 that perform almost as good or even better. This is an r1 moment not for the home users but for the distilling capabilities of having a model that will teach our small 8b or 32b models how to tool call as good as anthropic. Think about it... anthropic is not the smartest models but its the most wanted and used why? toolcalls! i know we wont be able to use this yet but we need a ton of computed data to "shrink" the capabilities into smaller usefull models.

I find it really frustrating that so many focus on making the models bigger and bigger. Many have asked for a smaller DeepSeek that can be run locally.

A smaller MoE on the deepseek architecture that fit on consumer hardware (8 GB VRAM, 32 GB RAM) like a 30-50b total parameter moe with around 5-12b activated parameters would be appreciated.

you dont get it. this is fantastic this is just the first step because its extreamly difficult to get models to do this so they go bigger to compensate for the training and tool calling. this model is not the end model. Deepseek r1 was abreakthrough for open source but you dont run it loccaly however because of it we have 8gb distilled versions of qwen3 that perform almost as good or even better. This is an r1 moment not for the home users but for the distilling capabilities of having a model that will teach our small 8b or 32b models how to tool call as good as anthropic. Think about it... anthropic is not the smartest models but its the most wanted and used why? toolcalls! i know we wont be able to use this yet but we need a ton of computed data to "shrink" the capabilities into smaller usefull models.

That is false, the R1 distills are not even close to performing like R1. They are based on a completely different architecture (Qwen) and not the deepseek architecture. They were a huge disappointment. I don't get why they just downscale the architecture instead of making such a mess.

I don't get why they just downscale the architecture instead of making such a mess.

Well two reasons:

  1. It takes way, way more money to pretrain a smaller, "proper" model instead of distilling the model into other ones like Qwen for example. In other words, it would take a lot more money to pretrain a DeepSeek R1 14B than to just fine tune Qwen2.5 14B on the R1's outputs.
  2. DeepSeek is somewhat limited in compute and (making an inference here) they'd much rather work on iterating on their big 685B R1.

Sign up or log in to comment