Feedback

#100
by ChuckMcSneed - opened

Dear Team DeepSeek,

I've been using the deepseek-r1-0528 model for a while and wanted to share some thoughts on how it compares to the original deepseek-r1. There are some noticeable differences in performance and behavior that I think are worth discussing.

First of all, the style: it is more coherent, less schizo/ADHD. I also noticed that you've changed from GPT to Gemini distillation, so that’s another improvement.

Another area of improvement is SFW roleplay. Here the new R1 mogs the old one, no debate. The integrated in-character thinking that the new R1 does must be helping it a lot.

Now, the things that got worse. First, while it is more coherent, it is less creative. OG R1 could get wild; this one is tame. What also got introduced is a positivity bias. The new R1 is less likely to righteously kill a character if the character chooses to act like a dumbass and poke the demon lord. It also made it worse at analyzing situations; old R1 could talk shit if everything was fucked up, whereas this one is more likely to stay positive, which I really don’t like.

The default assistant’s personality got worse. Totally can’t take banter. The new one is a really pissy, oversensitive snowflake that will give me suicide hotlines like Gemma (🤮) if I insult it during a heated programming session for making the same mistake 3 fucking times. The unreasonable refusals also got worse, especially the Western-style ones (“I cannot and will not fulfill this request” followed by a moral lecture); please filter them out from your dataset. I don’t need a fucking robot moralizing me; a simple “I can’t do that” will be fine, thank you.

The thinking process has also changed, but not for the better in all directions. For programming, it thinks more, which is okay, but it treats complex non-programming requests (which OG R1 could solve by “But wait,”-ing a couple of times) as simple problems, completely failing them. It is also far more likely to completely ignore its thinking process (where it had successfully solved the problem) and then write something else. This made it completely useless for my obscure non-programming use case; I’m glad that I can load up old R1 at any moment.

What I’ve also noticed is a loss of trivia knowledge; the new one is noticeably worse at it. The SimpleQA fall from 30.1 to 27.8 underestimates it.

Some issues from the previous one remain. “Oh, I’ve noticed the user liked the thing that I said! Let me repeat it to the point where he hates it, that'd be cool!” and incredible stubbornness are still very, very annoying.

I like it overall, but it is not a full upgrade for me.

Some more feedback:

Trivia knowledge seems to be only gone from the default assistant persona. When I switch to a different one, it suddenly recognizes it again. Strange, but once again proves that the default is suboptimal.

What I also don't like is how the new one is very inflexible in style (again, stubbornness). It simply wants to write in its default style no matter what (words and paragraph structure leak through, with that annoyingly inappropriate closing sentence). It even chooses to ignore what I ask it to do (in its thoughts it says stuff like "user asked to not use X, but I really need to use it here"). The old one at least tried to adhere in thoughts. I really hate how it's too methed-up to write boring serious stuff (reports), and too inflexible to write fun stuff the way I want it.

Still one of the best models, but needs wrestling.

Sign up or log in to comment