Ruurd (Ruurd Kuiper)

upvoted an article 23 days ago

Article

Diffusion Language Models: The New Paradigm

Jun 10

•

27

reacted to sergiopaniego's post with 🔥 24 days ago

Post

5337

fine-tuning a 14B model with TRL + SFT on a free Colab (T4 GPU)?
thanks to the latest TRL optimizations, you actually can!
sharing a new notebook showing how to do it 😎

colab: https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/sft_trl_lora_qlora.ipynb

notebooks in TRL: https://github.com/huggingface/trl/tree/main/examples/notebooks

2 replies

·

updated 2 Spaces 2 months ago

Tini-Lad

DLM

Lad

Paper • 2406.07524 • Published Jun 11, 2024 • 12

DLM

upvoted an article 5 months ago

Article

LAD: LoRA-Adapted Denoiser

Jun 7

•

6

upvoted a paper 5 months ago

Simple and Effective Masked Diffusion Language Models

liked a Space 6 months ago

Tini-Lad

Text Generation • 8B • Updated Sep 25, 2024 • 5.09M • • 5.04k

DLM

reacted to their post with 🔥👀❤️ 6 months ago

Post

2327

The past year I have been trying to get diffusion models to work for language generation, without having to retrain a LLM from scratch. And recently, we finally succeeded:

We introduce "LAD: LoRA-Adapted Denoiser", a method to convert a LLaMA model into a text diffusion model using LoRA finetuning and structured input corruption.

🎯 Try the demo and read the write-up here!
https://ruurdkuiper.github.io/tini-lad/

Unlike autoregressive (word-for-word) models like ChatGPT, diffusion models iteratively refine a noised sequence. However, most current diffusion approaches rely on all-parameter retraining and repeatedly remasking tokens, which is costly and slow during both training and inference!

🧠 With LAD:
- We can finetune an autoregressive model for diffusive generation in just 10 hours on a single GPU.
- Test-time compute is fully adjustable: fewer steps means faster outputs while more steps improve output quality.
- Due to our unique noising schedule, remasking is not always needed during inference. All tokens are attended to in each iteration!

🔍 LAD is built using:
– A frozen LLaMA-8B backbone
– Structured noising: token swaps, duplications, replacements, span shifts
– Modified attention masks for bidirectional decoding

💡 We show that even small, fast-trained models can perform diffusive generation — with competitive benchmark performance, perplexity and more flexible test-time behavior than traditional transformers.

2 replies

·

replied to their post 6 months ago

Thanks! I’m trying to get it under attention as I think the leap from pretraining (10000+ hours) diffusion models to mere finetuning (10 hours) for adaptation is a big one, and could really help this method gain some traction.

published an article 6 months ago

Article

LAD: LoRA-Adapted Denoiser

Jun 7

•

6

liked a model 6 months ago

meta-llama/Llama-3.1-8B-Instruct

posted an update 6 months ago

Post

2327

The past year I have been trying to get diffusion models to work for language generation, without having to retrain a LLM from scratch. And recently, we finally succeeded:

We introduce "LAD: LoRA-Adapted Denoiser", a method to convert a LLaMA model into a text diffusion model using LoRA finetuning and structured input corruption.

🎯 Try the demo and read the write-up here!
https://ruurdkuiper.github.io/tini-lad/

Unlike autoregressive (word-for-word) models like ChatGPT, diffusion models iteratively refine a noised sequence. However, most current diffusion approaches rely on all-parameter retraining and repeatedly remasking tokens, which is costly and slow during both training and inference!

🧠 With LAD:
- We can finetune an autoregressive model for diffusive generation in just 10 hours on a single GPU.
- Test-time compute is fully adjustable: fewer steps means faster outputs while more steps improve output quality.
- Due to our unique noising schedule, remasking is not always needed during inference. All tokens are attended to in each iteration!

🔍 LAD is built using:
– A frozen LLaMA-8B backbone
– Structured noising: token swaps, duplications, replacements, span shifts
– Modified attention masks for bidirectional decoding

💡 We show that even small, fast-trained models can perform diffusive generation — with competitive benchmark performance, perplexity and more flexible test-time behavior than traditional transformers.

2 replies

·

updated a model 6 months ago

Ruurd/tini_model

published a model 6 months ago

Ruurd/tini_model_large

Updated May 23

updated a Space 6 months ago

Lad

DLM

updated 2 models 6 months ago

Ruurd/tini_model

Ruurd/tini_model