Abstract
Text-to-text Self-conditioned Simplex Diffusion (TESS) achieves strong performance on natural language tasks by using a fully non-autoregressive approach that applies diffusion in logit simplex space.
Diffusion models have emerged as a powerful paradigm for generation, obtaining strong performance in various domains with continuous-valued inputs. Despite the promises of fully non-autoregressive text generation, applying diffusion models to natural language remains challenging due to its discrete nature. In this work, we propose Text-to-text Self-conditioned Simplex Diffusion (TESS), a text diffusion model that is fully non-autoregressive, employs a new form of self-conditioning, and applies the diffusion process on the logit simplex space rather than the typical learned embedding space. Through extensive experiments on natural language understanding and generation tasks including summarization, text simplification, paraphrase generation, and question generation, we demonstrate that TESS outperforms state-of-the-art non-autoregressive models and is competitive with pretrained autoregressive sequence-to-sequence models.
Community
I was looking into the possibly of blurring tokens/probabilities rather than fully masking tokens like the recent text diffusion model that was released. I found your paper under a random search for "text to text diffusion" and it's extremely similar to what I was envisioning. Changing an LLMs to be input and output instead of next token prediction is very appealing as a developer. I'd expect the speed increase to be immense, since we can add conditioning to keep the past conversations in context without having to feed the entire context into the model every time. It can also refine the entire response in parallel and "reason" in latent space, because it has the entire context represented as output instead of only the next token.
I'm just a self-taught hobbyist but I'd love to see your code to experiment with. I'm also curious about your training process. I assumed it would need to be trained on question and answer pairs? I didn't see much details in the paper. I'd also like to know if you ran into any challenges with this method? The only issue I see is that you need to wait for the entire text output to be generated. It seems like it would be cheaper to train than existing LLMs as well.
Why has no company jumped on this yet? I must be missing something. Thanks for your work! I think you are on to something here.
Thanks for your interest in our paper!
I assumed it would need to be trained on question and answer pairs?
We train on (a) a general pretraining dataset, which consists of lots of text scraped from the web (https://huggingface.co/datasets/allenai/dolma), and then instruction-tuning data, which consists of user questions and answers (https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture). This is pretty standard for training LMs. Our codebase has commands for the training.
I'd also like to know if you ran into any challenges with this method?
Yeah! We initially wanted to just run diffusion adaptation and instruction tune at the same time, but it turns out you really need to continue pretrain on high-quality pretraining data with the diffusion objective for a while to improve the diffusion model quality. This is costly and took a week or two to run on the compute we had available. We also had to be careful about the choice of base model to adapt, as we note in the paper. I don't think this is cheaper than current lms to train, in contrast, the diffusion objective means that we always train on max length samples, which can be a bit costly.
Why has no company jumped on this yet?
Actually, a week after we posted this, a company announced their own diffusion LM offering: https://www.inceptionlabs.ai/. They claim much faster generation than AR LMs, which is cool. I think also there is still a gap in performance between AR and diffusion models (that we observed), so closing this gap with further research is needed to make these models more popular.
I read some articles that say the inceptionlabs.ai model fully masks tokens, similar to standard LLMs, rather than adding noise to the probability simplex as your paper suggests. This possible shortcoming is how I found your paper. I emailed them asking for details.
I tested Mercury-Coder and it seems to have issues with local minima causing repeat words and missing punctuation as it diffuses itself into a corner. Did you see this same issue using your method?
It's defiantly the fastest LLM I've ever seen. Since current models don't scale well with compute, perhaps this could leverage it for a massive performance advantage? Thanks for sharing your insight!
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper