Memory-based Language Models: An Efficient, Explainable, and Eco-friendly Approach to Large Language Modeling
Abstract
Memory-based language modeling provides an efficient, eco-friendly alternative to deep neural networks, offering scalable performance and strong memorization with low ecological impact.
We present memory-based language modeling as an efficient, eco-friendly alternative to deep neural network-based language modeling. It offers log-linearly scalable next-token prediction performance and strong memorization capabilities. Implementing fast approximations of k-nearest neighbor classification, memory-based language modeling leaves a relatively small ecological footprint both in training and in inference mode, as it relies fully on CPUs and attains low token latencies. Its internal workings are simple and fully transparent. We compare our implementation of memory-based language modeling, OLIFANT, with GPT-2 and GPT-Neo on next-token prediction accuracy, estimated emissions and speeds, and offer some deeper analyses of the model.
Community
Do eco-friendly alternatives to Transformer-based LLMs exist? Sure they do - the decades-long history of LMs is littered with ideas and innovations that do not rely on GPUs.
To demonstrate this point, we have been dusting off some old code that implements memory-based language models. We are releasing the code under the name Olifant (the Dutch word for elephant); check it out on github: https://github.com/antalvdb/olifant
The graph below shows the estimated CO2 emissions (based on electricity usage monitored by CodeCarbon) of having our Olifant models predict the next tokens in a 500-thousand token validation text. The graph includes emission estimates for the old GPT-2 models and the slightly newer GPT-Neo. The x-axis is logarithmic and shows the amount of training tokens the models were trained on.
The graph also shows the amount of CO2 emitted after a washing machine run, a family car driving for 10 minutes, a tumble dryer run, producing 1 kg of steel, and producing 1l of cow milk. Two of our models stay well below the washing machine run, while larger GPT systems cost increasingly more electricity.
We totally expected our third model (the blue line) to be wasteful; it implements naive k-nearest neighbor classification. The other two show really good efficiency because they rely on prefix tries, a computer science classic optimization described by Don Knuth in 1973.
Note that this graph does not show the heavy energy energy consumption of GPT systems during training. Our models are estimated to consume about a factor 1000 less electricity during training. And they scale well; their predictions get log-linearly better with more data (as many LMs do).
Read about these models in the paper on arXiv and check out the software!
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper