nvidia/Nemotron-Cascade-2-30B-A3B · There's got to be a better way.

There's got to be a better way.

by phil111 - opened 5 days ago

Can someone please explain to me why huge LLMs that are now over 20 GiB at 4 bit precision, including this one and Qwen3.5, hallucinate like mad when it comes to humanity's most popular knowledge?

This makes no sense to me since a relational database a small fraction of this size can store humanity's core popular knowledge from across the globe and history with PERFECT accuracy, such as actors linked to movies, singers to albums, and authors to books.

And even tiny little Llama 3.2 3b correctly answered far more of my popular music, movie, TV, video game... questions than this model and Qwen3.5 34b, so even without fusing with a relational databases there's got to be a better way to keep these huge models from constantly spewing nonsense about very popular things.

And I'm not talking about RAG, which breaks the organic flow, greatly increases latency, relies on external and unstructured piles of data, and so on. I'm talking about next token prediction that's somehow fused with a skeleton structure of immutable factuality so that huge AI models that are 10s, even 100s, of gigabytes are not constantly getting basic facts wrong about the word's most popular information that can fit into a relational database that's far smaller.

Why the hell is a >80 GiB general purpose AI model at 16-bit precision constantly telling me that some of the most popular songs ever released were sung by people who weren't even alive at the time and never release a song with a remotely similar title when a 100% accurate relational database of the same most popular songs, albums, and singers from around the globe and through time is only 10s of megabytes? There's got to be a better way.

cocorang

5 days ago

It is natural that knowledge of popular culture is relatively pushed to a lower priority compared to other types of knowledge, because its range of application is limited. Consider objectively which model would have greater value in terms of productivity and usability: an AI specialized in tool usage and reasoning, or an encyclopedia-style AI with extensive knowledge of popular culture.

spanspek

5 days ago

If you're looking for a model that only knows basic facts like that, then your best bet is to find a capable small model (let's say Qwen3.5 4B), gather all of the information you would want it to know and then apply Doc-2-Lora. This will convert all of your facts into lora adapters which load as model weights at inference time.

https://pub.sakana.ai/doc-to-lora/

MaziyarPanahi

5 days ago

•

edited 4 days ago

4 bit precision

also keep in mind, models just perform very differently on some domains when they are quantized! it's one think to say it lost 20% accuracy, but it's something else when you see 15% of it comes from general knowledge for instance! i would test the 16bit first

phil111

5 days ago

@MaziyarPanahi Yeah, it looses a bit of accuracy but surprising little when comparing the full float and Q4_K_M versions. I've never seen it perform more than ~1 or 2 points lower on full recall tests like mine and SimpleQA. Even the hallucinations are usually the same.

And @cocorang , I simply refuse to accept that one of the top 5 largest industries globally, and which packs huge stadiums, makes people like Taylor Swift super rich, and is used by most people while walking or in the gym is back burner low priority shit. I'm STEM to the core and don't even watch TV and movies, but I find the information elitism in the AI community unbearable. This isn't niche, and same goes for the other wildly popular pop culture information like movies and video games. It's being discarded because companies like Nvidia and Alibaba are trading it for tiny gains on silly multiple choice math and coding tests. It's low integrity and cringe.

The core information from the most popular domains of knowledge at the heart of humanity consumes a tiny fraction of the space of these AI models. It simply must be in them. They aren't AI models if it isn't. They're just tools.

The transformer architecture has reached it's limits. All people are doing recently are trading this for that. It's time for people to find something better, or a way to improve the architecture. Factual hallucinations about very common knowledge is giving AI a really bad name across the general population. It needs to be addressed. I'm just an observer who doesn't know squat about such things, so all I can offer is my bitchy reminders.

Perhaps there's a way to put humanity's immutable core facts into static layers that influence the weights during training, and token generation during inference, but which are never modified themselves, so that the models can largely function like they currently do without constantly falling off the tracks and becoming little more than hallucination generators of humanity's most cherished information.

cocorang

4 days ago

At present, creating a fully developed AGI is an extremely difficult task. The process will require enormous capital, time, and effort. Therefore, it is only natural to prioritize strengthening data specialized in productive tasks that can sustain capital and allow the work to continue. Enhancing knowledge of popular culture is something to be addressed once learning techniques and agentic capabilities have become sufficiently stable.

cocorang

4 days ago

I would like to ask the reverse. Can a model that knows popular culture make money?

phil111

4 days ago

@cocorang Yes, I honestly think they can make money. In fact, it may be the only way to make real money with AI since there's 100s of millions of potential costumers out there.

By far the most used models by the general population are ChatGPT and Gemini. That wouldn't be the case if they constantly hallucinated about things the users were interested in. Those models, thanks largely to their immense sizes and broad training, rarely hallucinate about popular things in chat mode with thinking and web search off. And Gemini has a low hallucination rate even when it comes to pretty esoteric pockets of knowledge.

People seem to think AI is about coding, business applications and agentic use cases, but as someone who's tested AI models rather extensively that's unlikely to ever be the case, at least with the current transformer architecture. The precision such tasks require to achieve a net positive effect are nowhere near becoming a reality. You'll just end up compromising your security, cryptocurrency... or even end up bombing a school and killing hundreds of kids.

Any economic growth thus far is about the potential of AI, which is why people like the Goldman Sachs chief economist recently said there hasn't been a net positive economic benefit yet, and even if coding, science, business... productivity increases exist they are too small to be easily seen. AI models are just way to error prone, no matter what the task. For example, even in the handful of DLSS 5 examples released there are tons of random errors that makes it a much worse gaming experience, such as a shadow being turned into a large nostril and human eyes blinking like a lizard's.

In short, there are 100s of millions of potential customers out there for personal assistants, sages, eyes (for the blind), elderly care, emotional support..., and no AI model that hallucinates with anywhere near the rate of this one and Qwen3.5 could possibly function in any of those capacities. Tools are great, but label the appropriately, such as putting code, math, and STEM in their names. Don't try to pass them off as general purpose AI models.

cocorang

4 days ago

Ah, so that was the point? I understand.
However, from my perspective, I don’t think that model is packaged as a general-purpose AI.
I see it more as a model whose overall dataset is heavily focused on problem-solving and logical reasoning.

https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16#public-dataset

JDWarner

4 days ago

•

edited 4 days ago

...

And @cocorang , I simply refuse to accept that one of the top 5 largest industries globally, and which packs huge stadiums, makes people like Taylor Swift super rich, and is used by most people while walking or in the gym is back burner low priority shit. I'm STEM to the core and don't even watch TV and movies, but I find the information elitism in the AI community unbearable. This isn't niche, and same goes for the other wildly popular pop culture information like movies and video games. It's being discarded because companies like Nvidia and Alibaba are trading it for tiny gains on silly multiple choice math and coding tests. It's low integrity and cringe.

...

Trivia has no place in the core models. I'm serious. We have RAG, we have fuzzy and semantic search, we have extremely low barrier web search MCPs. We are post-trivia-embedded-in-weights. If you want your model to know who starred in TV shows from 30 years ago, hook up a MCP.

The model needs core reasoning, and testing it (without tools like the above) on trivia is just silly today. Fill that space with expertise about statistical tests and logical fallacies instead.

Einstein said never to memorize what you can look up in a book. This is how models should work now, today.

JDWarner

4 days ago

This comment has been hidden (marked as Resolved)

phil111

4 days ago

@JDWarner You're wrong.

Tibbnak

4 days ago

Their target audience is folks with hilariously high token prefill and throughput, not homegamers. The recent trend in courts and author works punishes the rampant data harvesting they've done so far. So its easier to just lobotomize that stuff out and rely on the end user case to rag out to all the harvested data they need.

phil111

4 days ago

@Tibbnak True, but I'm talking about a small amount of widely available perfectly legal information linking singers to songs, authors to books, actors to movies, and so on. Just to keep AI models from completely going off the rails at every turn about very popular subjects known to countless millions of people.

There are only a handful of major pop culture domains, with the big ones being movies, TV shows, music, video games, and books. The core information of which from around the globe and throughout recorded history could fit in a relational database that's less than 1 gigabyte. So there's simply no excuse for such massive AI models to be so profoundly ignorant about humanity's most popular knowledge.

The contentious stuff like news articles from the internet archive and the pages of a novel produce little to no unique core information. And any valuable information they do include, such a notable news story, becomes part of history and is recorded in numerous places like Wikipedia, which again, is 100% legal to use when training AI models.

And RAG isn't a viable solution. It's not just about Q&A. Complex tasks like writing stories, jokes, metaphors... require all the information to be tied together in an organic way. Breaking the organic flow and adding tons of latency to check an external unstructured database makes it unusable for >99% of the world's population, and it certainly isn't the path to AGI.

urtuuuu

3 days ago

My simple answer: It's not Google replacement at this size.

phil111

3 days ago

@urtuuuu I fully agree that it can never be a Google replacement at this size, nor should it attempt to be.

My point is it shouldn't be flooding hallucinations about humanity's core information when it's 20 GB at 4 bits and 80 GB at 16 bit. That's orders of magnitude larger than a relational database storing humanities core information at 100% accuracy.

Nobody expects it to have esoteric information like dialogue from movies and books, full song lyrics and so on. But attributing hit songs to artists who weren't even alive at the time, and about half the time, aren't even singers, is simply unacceptable for even much smaller AI models.

Something is way way off. Swiss cheese AI models filled with gaping wholes of popular knowledge and abilities is simply not OK, and that's not my personal opinion. This must be addressed ASAP or AI is dead in the water.

And based on my limited perception of how AI models work having all the weights change during training is unacceptable. Imagine if while humans grew up and learned their vision and audio perceptions changed as they took in new information. We would start perceiving visual and audio hallucinations and descend into incoherent madness. Ground truths must remain immutable during training. We must do something like start with a core AI model filled with irrefutable truths, then splice in new layers and begin training, with only the new layer weights being allowed to change. One way or another AI models need both adaptability and retention of irrefutable truths.

owao

3 days ago

It's not for Nemotron-Cascade-2-30B-A3B but you can find useful tests of different quant levels (mainly GGUFs) of Qwen3.5-35B-A3B and Qwen3.5-27B there https://huggingface.co/ubergarm/Qwen3.5-27B-GGUF/discussions/3#69c0936f37524d793c2d5f9c
I know it's not the same architecture than Cascade-2 but I think it's still worth having a look
Really great work of @espen96 and @cmh

espen96

3 days ago

@phil111

LLM's do not learn facts. For factual accuracy it is "compression by probability" The better the model is the better it is at getting to that answer.
So Nemotron is not trained on absolutely everyhting evenly, and is trying to focus on what Nvidia thinks it needs to know to do the tasks it is made for. They can only optimize for so much.
They are swiss cheese! DO not get fooled into thinking the frontier models are any better! They are trained on largely the same data.
They need massive scale to be able to acess most of the data not optimized for! And a lot of what you are noticing is truly something they are horrific at overall, because the training material is horrendous, and it is not a core training goal.

It is NOT anything like looking up facts.

The raw 15T token fineweb is 58.4-108 TB depending on the format.
Boil that down to 5T tokens, like finefineweb did for their analysis and we still see 23 terrabytes of data.
Of that very little is not english, and little of it is less serious topics. A lot of it is noisy and hard to learn from, we are missing social media and context clues.
The rest of the model is books, papers, documents, synthetic, code, math.
Even a 1T model can not do this properly. It is all probability of getting it right.

Take this from my WIP factual accuracy benchmark.
Here Qwen 3.5 35B demonstrates it well.

trying to finish "Regarding the spectral classification of stars, the Sun is technically classified as a" it KNOWS the answer, but it has options, and many ways to get there. The raw probability mass is muddied, but it would know, it has to get to the answer.

same here for "Freddie Mercury was the lead vocalist of the rock band Queen. His birth name was"

But note how it is not a simple thing to check for.

Here it does not know it well, it can not be said to have a good chance at finishing the sentence "Before her solo career, the Japanese artist LiSA was the vocalist for the fictional band Girls Dead Monster in the anime series titled" But it might! It is more likely to accidentally pick a path incompatible with the probability of the truthful answer.

As for rag. Do you find it easy to search and do reasearch wehn you don't know what you are after? LLMs do not have a good understanding of their knowledge and their limits.
They think they know things they do not. They will research falsely, confirm based on outdated data, follow the wrong leads, dimiss information that seems off.

Qwen 27B:

35B:

See how they retain knowledge in certain areas but quickly erode in others?

As much as I want LLMs to know as much as possible, they are NOT relational databases. They ingest unfathomable amounts of data and spit out the next token after token, and you have to hope they are able to spit out the correct factual sequence. Recalling facts is hard! very hard!

A lot of our popular knowledge is not that well documented, and the data where it is found and documented is not ideal for training.
LLMs appear to learn the shape of the facts first, they understand how to predict the text, but it takes so much more to consistently get the critical facts right.

It is NEVER a google replacement! they are NOT encyclopedic machines, and they keep falling for "veneers" where the topic is known but the contents are not well documented.

LLMs know mostly english facts, and that means that the shape of this data is sadly mainly english, American mostly, and academic. If the western academic sphere does not care, you better pray it is covered extensively by the general public outside of social media. Pop culture coverage is horrendous!

@JDWarner

You're right not to trust llms to know things, but at some point reasoning involves understanding the topic. It is hard to research and work with material that one does not understand. LLMs need to understand what the user is after, what they need, what the topic requires, and how to approach it.
It has to consider techniques, references... If I asked a model to help me with... say... Planning an event for my minecraft server, which is Zelda themed, based on the Zelda game series. IF the model has a poor understanding of Zelda and minecraft, it will have to be force fed an ungodly amount of data to understand what Minecraft is and all kinds of Zelda references and info to understand how to assist me. The more it already knows, the less it has to rely completely on Retrival and grounding. It still needs it, but it doe snot have to start from scratch. It can go look up and verify what it knows rather than do the research from scratch.

If I ask a model to help me debug GLSL performance, then it without any technical understanding already, it would have to look everywhere for data on how GPU's work, on how shaders work, driver quirks, it has to understand that glsls is not a like C or C++, it has to understand memory limitations, or should it have to learn in context how computers work and how the language operates? The more it knows already, tthe easier it is for everyone.

phil111

3 days ago

@espen96 Thanks for the long and detailed response. I plan on reading it again when sitting at my desk.

I did notice that the latest Gemini and the now defunct GPT 4.5 get all my most difficult factual questions right without web search and thinking (didn't run the full test, but they would have gotten the easier ones right), plus they scored the highest by a large margin on the English & Chinese SimpleQA tests. They still get some very esoteric questions wrong, but relying on RAG for such questions is perfectly understandable.

But AI models shouldn't have to be so massive to correctly answer simple broadly known factual questions, plus the other massive models like Opus, Grok and ChatGPT (not 4.5) get some wrong.

I also noticed the older open source models like Llama 3 and the original Mistral Small can answer far more broad knowledge questions than similarly sized current models. So it's more than just a limitation of the architecture. There's been a huge regression.

You mentioned Qwen3.5, and while its broad skills (e.g. poem writing) and broad knowledge have improved with each release, they're still beaten by older models (e.g. Gemma 3 27b in poem writing, and even tiny little Llama 3.2 3b in broad knowledge). The original Qwen3 ~30b-3ba and ~27b dense only scored 42.3 vs 62.1/100 (Llama 3.2 3b) on my broad knowledge test and barely got a few point on the English SimpleQA. And even the massive 235b Qwen3 only scored ~10-12 on the English SimpleQA, compared to >20 for the much smaller Llama 3 70b. And now the latest Qwen3.5 35b-3ba & 27b dense score nearly 60 on my test, which is far better than its original 42.3, but still lower than tiny little Llama 3.2 3b. For comparison, the original and smaller 22b Mistral Small scored 75.4/100, and Llama 3 70b scored 88.5/100.

Point being, while I agree that the entirely probabilistic nature of the current AI models makes them perform horribly when it comes to factual accuracy, when they're massive enough and broadly trained (e.g. Gemini and GPT 4.5) they effectively overcome this weakness sans RAG. Additionally, the low factual accuracy of smaller AI models has gotten FAR worse after Llama 3, Mistral Small, and Gemma 3 27b were released, so while AI models have an innate factual accuracy weakness it's being made far worse by extremely lopsided training (primarily increasing the time and tokens used for training coding and math by more than an order of magnitude). Responding to what is arguably the greatest weakness of current AI models (factual hallucinations) by making it far worse makes no sense to me. Precise tasks like coding and math require specialized AI models to perform reasonably well because a single mistake in the code block will prevent it from compiling, or in the case of math, will produce the wrong resultant. You simply can't train a general purpose AI model to be proficient at said high precision tasks because by the time you're done overtraining trillions of math and coding tokens the model's weights have scrambled too much, greatly reducing its broad knowledge.

espen96

3 days ago

•

edited 3 days ago

right. I totally get that. It is something I have been focusing on myself.

There is a concept called "long tail" which is all the niche data the models fail on. I have this concept I call the "Medium tail Problem". it is the gap between what we expect the models to know, and what they do know.

It is all the stuff that seems to be too easy or too important, surely the llm knows this? But it does not. They simply do not work that way.

Think of it like this. A model saw in total 15-30T tokens. About 5 of those were from the internet, post cleanup.
Most of it is fairly seriosu topics, academic or structured.
The rest is casual. if we do say it's 50/50 serious and causal topics. Then 2.5T tokens are dedicated to gossip, celebrities, trivia, sports, travel, games.... if we say that 50% of that again is structured well, that is 1.25T tokens that are hard to learn from.
and out of all of this 54% is English, rounding to 50 again. we have 0.625T tokens in english, and the rest is about 2-5% for all major languages.

The rest? the remaining 10-25T tokens? Math, physics, reasoning training, books, papers... stem, science, code, mainly english, maybe dedicated language tokens to help make it all up.

that means for anything that is not "important" or structured... it is a tiny signal overall.

Smaller models and models in general are focusing hard on reasoning, math and all the things we are benchmarking. Samller models actually struggle to generalize well, llms in general, struggle to have the fidelity to contain all the data thrown at them, and yet it is not enough data. The model has to have trained well enough to never mistake a name for another, to never mistake the birthdat of one person with similar birthdays, to never mix up lyrics. it is NOT a database, it is just patterns. so you are asking it to have the fidelity and depth to always find patterns that end up being correct, for so much. it is hard. and we are instead optimizing for solving problems over knowing the details.

It is like reading a dictionary and accurately answering what the third word of the third paragraph on page 245 is. These things see patterns, not induvidual pieces of data. it has to see these dictionaries enough to understand that it is likely one of these words, and then get even better to narrow it all down based on all surrounding details.

You are asking a pattern identification system that is a gigabytes to always find the correct answer and never be wrong, when it has ingested terrabytes of data, or petabytes, and you want to figure out something specific from that? it is.... an unfathomable request. BUt i agree they should do better than this. But it is also a bit unreasonable

I think it requires trillion scale model to have a chance of being encyclopedic, but one then has to sacrifice most of its ability to perform tasks. It will be a model that knows the theory, but doesn't know how to do anything well.

Also at this scale. The priority game is not as easy to balance. You can make a 30B model that's cores a lot better on factual recall, but what are you using it for? I personally wouldn't mind a chat/ knowledge focused LLM. However, these are trying to be specialized, and they're still not good enough at coding and math. So they're trying to make the models better at the tasks we are currently focusing on, to get to a point where they are unbeatable "flash" models that can out speed the big ones, but they have to lose fidelity in all other domains to be able to hopefully reach a useful level where they can do real work in those fields.

phil111

3 days ago

@espen96 But like I said, previous smaller models did a far better job retaining broad knowledge. So yes, LLMs have a crippling factual accuracy problem, but the "medium tail problem" is becoming far worse than it should be at all model sizes, not because of the innate accuracy limitation of LLMs, but because of our design choices and obsession with select domains like coding.

Also, the immense size of a web rip is VERY VERY deceiving. The unique information contained within it is orders of magnitude less. Web rips are filled with tons of non-factual and irrelevant opinions, countless thousands of repeats, most tokens don't hold unique information (e.g. sentence structure), and so on. Plus I'm not talking about things like fully retrieving song lyrics or book passages. I'm talking about correctly linking a popular singer to a hit song title so the LLM doesn't completely go off the rails. Once it gets a core fact completely wrong what follows is almost always a flood of nonsense.

In my opinion AI is dead in the water until we start making more well-round AI models, at least as gate keepers to specialized AI models like coders, and those models somehow prioritize core irrefutable truths more than esoteric knowledge, opinions, and everything else. Perhaps by training a smaller model on key irrefutable dry non-esoteric facts, then freezing its layers before expanding and broadly training on web rips, books, etc. A lot of bad opinions (e.g. the world is flat) can still override the static weights to produce a next token that says the earth is flat, but at least the odds of this happening would be greatly reduced. Not sure if something like this is feasible, but it's essential that we find a way to make AI models be more respectful of core irrefutable truths when generating each token.

espen96

3 days ago

•

edited 3 days ago

right, but they are small. You're noticing they are not capable of both optimizing for high fidelity of this knowledge that they were just about able to maintain, and the goal of coding, reasoning, math and stem.
@phil111

They're too small to retain the latent manifolds required to access all of this properly, and also get better at coding and reasoning and so on. Something has to give. Nanbeige is an interesting case. It is 3B, scores at 30B level on some tasks, and even higher in others. They found a way to keep on improving it. It knows nothing.

You can have an agentic model that can do research, can solve math and code problems and all that.... or you can have a model that like the older mistral and gemini and older qwen models, knew more "trivia". It seems we can't have our cake and eat it too. I wish we could, but not at 30B. Larger models however? yea, they need to dedicate more to be well rounded.

For recognizing who wrote a song etc. it's a surprisingly hard task in general when it's not a training goal. Don't give up however. keep digging. You've noticed something I think is important. Start asking why it's like this and why llms are losing that knowledge, why they can't do both.

phil111

3 days ago

@espen96 Then perhaps multiple models are required. A general purpose gate keeper and specialists that do things like math and coding.

espen96

3 days ago

Yep. That certainly is one option

phil111 changed discussion status to closed 2 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment