fine-tuning a 14B model with TRL + SFT on a free Colab (T4 GPU)? thanks to the latest TRL optimizations, you actually can! sharing a new notebook showing how to do it ๐
The past year I have been trying to get diffusion models to work for language generation, without having to retrain a LLM from scratch. And recently, we finally succeeded:
We introduce "LAD: LoRA-Adapted Denoiser", a method to convert a LLaMA model into a text diffusion model using LoRA finetuning and structured input corruption.
Unlike autoregressive (word-for-word) models like ChatGPT, diffusion models iteratively refine a noised sequence. However, most current diffusion approaches rely on all-parameter retraining and repeatedly remasking tokens, which is costly and slow during both training and inference!
๐ง With LAD: - We can finetune an autoregressive model for diffusive generation in just 10 hours on a single GPU. - Test-time compute is fully adjustable: fewer steps means faster outputs while more steps improve output quality. - Due to our unique noising schedule, remasking is not always needed during inference. All tokens are attended to in each iteration!
๐ LAD is built using: โ A frozen LLaMA-8B backbone โ Structured noising: token swaps, duplications, replacements, span shifts โ Modified attention masks for bidirectional decoding
๐ก We show that even small, fast-trained models can perform diffusive generation โ with competitive benchmark performance, perplexity and more flexible test-time behavior than traditional transformers.
Thanks! Iโm trying to get it under attention as I think the leap from pretraining (10000+ hours) diffusion models to mere finetuning (10 hours) for adaptation is a big one, and could really help this method gain some traction.
The past year I have been trying to get diffusion models to work for language generation, without having to retrain a LLM from scratch. And recently, we finally succeeded:
We introduce "LAD: LoRA-Adapted Denoiser", a method to convert a LLaMA model into a text diffusion model using LoRA finetuning and structured input corruption.
Unlike autoregressive (word-for-word) models like ChatGPT, diffusion models iteratively refine a noised sequence. However, most current diffusion approaches rely on all-parameter retraining and repeatedly remasking tokens, which is costly and slow during both training and inference!
๐ง With LAD: - We can finetune an autoregressive model for diffusive generation in just 10 hours on a single GPU. - Test-time compute is fully adjustable: fewer steps means faster outputs while more steps improve output quality. - Due to our unique noising schedule, remasking is not always needed during inference. All tokens are attended to in each iteration!
๐ LAD is built using: โ A frozen LLaMA-8B backbone โ Structured noising: token swaps, duplications, replacements, span shifts โ Modified attention masks for bidirectional decoding
๐ก We show that even small, fast-trained models can perform diffusive generation โ with competitive benchmark performance, perplexity and more flexible test-time behavior than traditional transformers.