Kimi-K2-Mini

by PSM24 - opened 17 days ago

Discussion

PSM24

17 days ago

Are there any plans to create a miniature model like Llama-4 Scout (100-200B params)?

qingy2024

17 days ago

•

edited 17 days ago

100B MoE would be awesome, I can run it pretty decently on 32gb vram + 60gb ddr5 ram

drmcbride

17 days ago

100B with 1B active for cpu bound and mlx

Doctor-Chad-PhD

17 days ago

30B with maybe ~3B or ~6B active

Dampfinchen

16 days ago

I find it really frustrating that so many focus on making the models bigger and bigger. Many have asked for a smaller DeepSeek that can be run locally.

A smaller MoE on the deepseek architecture that fit on consumer hardware (8 GB VRAM, 32 GB RAM) like a 30-50b total parameter moe with around 5-12b activated parameters would be appreciated.

melekuk

16 days ago

Hunyuan?

xxr3376

Moonshot AI org 16 days ago

@PSM24 We have an much smaller modal: https://huggingface.co/moonshotai/Kimi-VL-A3B-Instruct , but it may be too small for you.

We don't have plan to provider a medium size model now.

jacek2024

16 days ago

I hope you’ll consider this in the future, relatively few people have the resources to run Kimi-K2, whereas models around 100B are significantly more accessible for home computer setups.

Dampfinchen

16 days ago

•

edited 16 days ago

@PSM24 We have an much smaller modal: https://huggingface.co/moonshotai/Kimi-VL-A3B-Instruct , but it may be too small for you.

We don't have plan to provider a medium size model now.

Thank you for answering our enquiries. That's unfortunate to hear. People have been wanting a model on the full DeepSeek architecture that can be run on their home computers for so long, but R1 and now this one are just way too big. Most computers have around 32 GB RAM so a MoE with total parameters between 30-60b would be great, if you ever reconsider this.

PSM24

16 days ago

•

edited 16 days ago

@PSM24 We have an much smaller modal: https://huggingface.co/moonshotai/Kimi-VL-A3B-Instruct , but it may be too small for you.

We don't have plan to provider a medium size model now.

That is unfortunate. Great job on Kimi-K2!

lsw825

Moonshot AI org 16 days ago

Pruning or distilling a mid-sized model from a large one is far easier—and more efficient—than training from scratch. Now that Kimi-K2 is open-source, we’re delighted to see users and organizations already taking on this challenge. Let's look forward to their achievements together
.

kth8

16 days ago

the community is already working on it: https://reddit.com/r/LocalLLaMA/comments/1ly9iqw/k2mini_successfully_compressed_kimik2_from_107t/

qingy2024

16 days ago

•

edited 16 days ago

the community is already working on it: https://reddit.com/r/LocalLLaMA/comments/1ly9iqw/k2mini_successfully_compressed_kimik2_from_107t/

From what I can tell it's mostly vibe-coded and I don't have too much hope that it'll perform well.

I think what would be more interesting is what the a-m-team will do with the model. They've previously created quite a few large-scale, high quality datasets like from R1-0528 and Qwen3 235B, so on the distillation side, that'll be pretty interesting.

H0ARK

13 days ago

I find it really frustrating that so many focus on making the models bigger and bigger. Many have asked for a smaller DeepSeek that can be run locally.

A smaller MoE on the deepseek architecture that fit on consumer hardware (8 GB VRAM, 32 GB RAM) like a 30-50b total parameter moe with around 5-12b activated parameters would be appreciated.

you dont get it. this is fantastic this is just the first step because its extreamly difficult to get models to do this so they go bigger to compensate for the training and tool calling. this model is not the end model. Deepseek r1 was abreakthrough for open source but you dont run it loccaly however because of it we have 8gb distilled versions of qwen3 that perform almost as good or even better. This is an r1 moment not for the home users but for the distilling capabilities of having a model that will teach our small 8b or 32b models how to tool call as good as anthropic. Think about it... anthropic is not the smartest models but its the most wanted and used why? toolcalls! i know we wont be able to use this yet but we need a ton of computed data to "shrink" the capabilities into smaller usefull models.

Dampfinchen

13 days ago

I find it really frustrating that so many focus on making the models bigger and bigger. Many have asked for a smaller DeepSeek that can be run locally.

A smaller MoE on the deepseek architecture that fit on consumer hardware (8 GB VRAM, 32 GB RAM) like a 30-50b total parameter moe with around 5-12b activated parameters would be appreciated.

you dont get it. this is fantastic this is just the first step because its extreamly difficult to get models to do this so they go bigger to compensate for the training and tool calling. this model is not the end model. Deepseek r1 was abreakthrough for open source but you dont run it loccaly however because of it we have 8gb distilled versions of qwen3 that perform almost as good or even better. This is an r1 moment not for the home users but for the distilling capabilities of having a model that will teach our small 8b or 32b models how to tool call as good as anthropic. Think about it... anthropic is not the smartest models but its the most wanted and used why? toolcalls! i know we wont be able to use this yet but we need a ton of computed data to "shrink" the capabilities into smaller usefull models.

That is false, the R1 distills are not even close to performing like R1. They are based on a completely different architecture (Qwen) and not the deepseek architecture. They were a huge disappointment. I don't get why they just downscale the architecture instead of making such a mess.

qingy2024

13 days ago

I don't get why they just downscale the architecture instead of making such a mess.

Well two reasons:

It takes way, way more money to pretrain a smaller, "proper" model instead of distilling the model into other ones like Qwen for example. In other words, it would take a lot more money to pretrain a DeepSeek R1 14B than to just fine tune Qwen2.5 14B on the R1's outputs.
DeepSeek is somewhat limited in compute and (making an inference here) they'd much rather work on iterating on their big 685B R1.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment