dvruette
/

gidd-small-p_unif-0.2

@@ -1,9 +1,11 @@
 ---
-license: mit
 datasets:
 - Skylion007/openwebtext
 tags:
 - diffusion
 ---
 # Generalized Interpolating Discrete Diffusion
@@ -36,7 +38,7 @@ Our trained checkpoints are available under the following links. All of them hav
 |-------|-------|------|
 | GIDD+ (p_u = 0.0) | [dvruette/gidd-small-p_unif-0.0](https://huggingface.co/dvruette/gidd-small-p_unif-0.0) | [dvruette/gidd-base-p_unif-0.0](https://huggingface.co/dvruette/gidd-base-p_unif-0.0) |
 | GIDD+ (p_u = 0.1) | [dvruette/gidd-small-p_unif-0.1](https://huggingface.co/dvruette/gidd-small-p_unif-0.1) | [dvruette/gidd-base-p_unif-0.1](https://huggingface.co/dvruette/gidd-base-p_unif-0.1) |
-| GIDD+ (p_u = 0.2) | dvruette/gidd-small-p_unif-0.2 | [dvruette/gidd-base-p_unif-0.2](https://huggingface.co/dvruette/gidd-base-p_unif-0.2) |
 ## Use the Model
@@ -63,3 +65,59 @@ corrected_texts = pipe.self_correction(texts, num_inference_steps=128, early_sto
 print(corrected_texts)
 ```

 ---
 datasets:
 - Skylion007/openwebtext
+license: mit
 tags:
 - diffusion
+pipeline_tag: text-generation
+library_name: transformers
 ---
 # Generalized Interpolating Discrete Diffusion
 |-------|-------|------|
 | GIDD+ (p_u = 0.0) | [dvruette/gidd-small-p_unif-0.0](https://huggingface.co/dvruette/gidd-small-p_unif-0.0) | [dvruette/gidd-base-p_unif-0.0](https://huggingface.co/dvruette/gidd-base-p_unif-0.0) |
 | GIDD+ (p_u = 0.1) | [dvruette/gidd-small-p_unif-0.1](https://huggingface.co/dvruette/gidd-small-p_unif-0.1) | [dvruette/gidd-base-p_unif-0.1](https://huggingface.co/dvruette/gidd-base-p_unif-0.1) |
+| GIDD+ (p_u = 0.2) | [dvruette/gidd-small-p_unif-0.2](https://huggingface.co/dvruette/gidd-small-p_unif-0.2) | [dvruette/gidd-base-p_unif-0.2](https://huggingface.co/dvruette/gidd-base-p_unif-0.2) |
 ## Use the Model
 print(corrected_texts)
 ```
+## Reproducing Experiments
+### Training
+To reproduce the training runs from the paper, you can use the following commands.
+In this example, we are training on a single node with 8 GPUs, feel free to adjust the `--nnodes` and `--nproc_per_node` arguments to match your setup.
+The checkpoints will be saved under `./outputs/{YYYY-MM-DD}/{HH-MM-SS}/checkpoints/` by default.
+(optional) Log into W&B with `wandb login` for experiment tracking or disable via `wandb disabled` if you don't need/want it.
+```bash
+# GIDD+ (p_u = 0.0)
+torchrun --nnodes 1 --nproc_per_node 8 gidd/train.py --config-name gidd logging.run_name="'small-gidd+-owt-pu=0.0'"
+# GIDD+ (p_0 > 0.0)
+torchrun --nnodes 1 --nproc_per_node 8 gidd/train.py --config-name gidd model.p_uniform=0.1 logging.run_name="'small-gidd+-owt-pu=0.1'"
+# MDLM baseline
+torchrun --nnodes 1 --nproc_per_node 8 gidd/train.py --config-name mdlm logging.run_name="'small-mdlm-owt'"
+# AR baseline
+torchrun --nnodes 1 --nproc_per_node 8 gidd/train.py --config-name ar logging.run_name="'small-ar-owt'"
+```
+### Evaluation
+There are also a couple of scripts to run inference and evaluate the trained models.
+Note that these scripts expect the checkpoint format that is saved by the training script, so the checkpoints from HuggingFace are not directly compatible.
+You can download our original training checkpoints from here: https://polybox.ethz.ch/index.php/s/BbxZcYDSoXf8aL4
+#### Generate samples
+The following command will generate `num_samples=16` samples in `num_denoising_steps=128` iterations from the model checkpoint located at `path` and save them to `samples_dir=samples.pt`.
+```bash
+python gidd/eval/generate_samples.py path=./outputs/path/to/checkpoint/ samples_path=samples.pt num_samples=16 num_denoising_steps=128 batch_size=16
+```
+#### Generative PPL
+Given a file containing samples generated with the `generate_samples.py` script, the following command will compute the generative PPL.
+Here we assume that the diffusion model used to generate samples located at `samples.pt` uses the `gpt2` tokenizer, and we compute generative PPL using `google/gemma-2-9b` as a reference model (note that `gemma-2-9b` requires you to log into your HF account using `huggingface-cli login`).
+The results will be saved to `metrics_path=metrics.json`.
+```bash
+python gidd/eval/generative_ppl.py samples_path=samples.pt model_tokenizer=gpt2 pretrained_model=google/gemma-2-9b batch_size=4 metrics_path=metrics.json
+```
+#### Validation loss
+A simple helper script to compute the loss of a trained model on the entire validation split.
+```bash
+python gidd/eval/loss.py path=./outputs/path/to/checkpoint/ batch_size=32
+```
+#### Self-correction
+This script will run the self-correction step on the samples contained in `samples.pt` (e.g. generated with the `generate_samples.py` script) and save the corrected samples to `corrected_samples.pt`.
+The `temp` argument controls the temperature used when resampling tokens from the model (see paper for more details).
+```bash
+python gidd/eval/self_correction.py path=./outputs/path/to/checkpoint/ samples_path=samples.pt corrected_samples_path=corrected_samples.pt batch_size=16 num_denoising_steps=128 temp=0.1
+```