arxiv:2510.22317

Memory-based Language Models: An Efficient, Explainable, and Eco-friendly Approach to Large Language Modeling

Published on Oct 25

· Submitted by

Antal on Oct 28

Utrecht University

Upvote

Authors:

Abstract

Memory-based language modeling provides an efficient, eco-friendly alternative to deep neural networks, offering scalable performance and strong memorization with low ecological impact.

AI-generated summary

We present memory-based language modeling as an efficient, eco-friendly alternative to deep neural network-based language modeling. It offers log-linearly scalable next-token prediction performance and strong memorization capabilities. Implementing fast approximations of k-nearest neighbor classification, memory-based language modeling leaves a relatively small ecological footprint both in training and in inference mode, as it relies fully on CPUs and attains low token latencies. Its internal workings are simple and fully transparent. We compare our implementation of memory-based language modeling, OLIFANT, with GPT-2 and GPT-Neo on next-token prediction accuracy, estimated emissions and speeds, and offer some deeper analyses of the model.

View arXiv page View PDF GitHub 4 Add to collection

Community

antalvdb

Paper submitter 1 day ago

Do eco-friendly alternatives to Transformer-based LLMs exist? Sure they do - the decades-long history of LMs is littered with ideas and innovations that do not rely on GPUs.

To demonstrate this point, we have been dusting off some old code that implements memory-based language models. We are releasing the code under the name Olifant (the Dutch word for elephant); check it out on github: https://github.com/antalvdb/olifant

The graph below shows the estimated CO2 emissions (based on electricity usage monitored by CodeCarbon) of having our Olifant models predict the next tokens in a 500-thousand token validation text. The graph includes emission estimates for the old GPT-2 models and the slightly newer GPT-Neo. The x-axis is logarithmic and shows the amount of training tokens the models were trained on.

The graph also shows the amount of CO2 emitted after a washing machine run, a family car driving for 10 minutes, a tumble dryer run, producing 1 kg of steel, and producing 1l of cow milk. Two of our models stay well below the washing machine run, while larger GPT systems cost increasingly more electricity.

We totally expected our third model (the blue line) to be wasteful; it implements naive k-nearest neighbor classification. The other two show really good efficiency because they rely on prefix tries, a computer science classic optimization described by Don Knuth in 1973.

Note that this graph does not show the heavy energy energy consumption of GPT systems during training. Our models are estimated to consume about a factor 1000 less electricity during training. And they scale well; their predictions get log-linearly better with more data (as many LMs do).

Read about these models in the paper on arXiv and check out the software!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.22317 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.22317 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.