Update README.md (#2)
Browse files- Update README.md (789a1c780444af9a54cec6f3e3ac0e1e4cfb982d)
Co-authored-by: Arthur Zucker <[email protected]>
README.md
CHANGED
|
@@ -39,6 +39,10 @@ widget:
|
|
| 39 |
It's not certain how many lessons you'll learn by your thirties. Does the
|
| 40 |
premise entail the hypothesis?
|
| 41 |
example_title: Premise and hypothesis
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
tags:
|
| 43 |
- text2text-generation
|
| 44 |
datasets:
|
|
@@ -56,17 +60,21 @@ datasets:
|
|
| 56 |
license: apache-2.0
|
| 57 |
---
|
| 58 |
|
| 59 |
-
# TL;DR FLan-UL2
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
The original UL2 model also had mode switch tokens that was rather mandatory to get good performance.
|
| 64 |
-
However, they were a little cumbersome as this requires often some changes during inference or finetuning. In this update/change, we continue training UL2 20B for an additional 100k steps (with small batch) to forget “mode tokens” before applying Flan instruction tuning. This Flan-UL2 checkpoint does not require mode tokens anymore.
|
| 65 |
|
| 66 |
-
|
| 67 |
-
|
|
|
|
|
|
|
| 68 |
|
| 69 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
|
| 71 |
The reported results are the following :
|
| 72 |
| | MMLU | BBH | MMLU-CoT | BBH-CoT | Avg |
|
|
@@ -76,8 +84,26 @@ The reported results are the following :
|
|
| 76 |
| FLAN-T5-XXL 11B | 55.1 | 45.3 | 48.6 | 41.4 | 47.6 |
|
| 77 |
| FLAN-UL2 20B | 55.7(+1.1%) | 45.9(+1.3%) | 52.2(+7.4%) | 42.7(+3.1%) | 49.1(+3.2%) |
|
| 78 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 79 |
|
| 80 |
-
|
|
|
|
| 81 |
|
| 82 |
UL2 is a unified framework for pretraining models that are universally effective across datasets and setups. UL2 uses Mixture-of-Denoisers (MoD), apre-training objective that combines diverse pre-training paradigms together. UL2 introduces a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes.
|
| 83 |
|
|
@@ -95,9 +121,12 @@ Authors: *Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal
|
|
| 95 |
|
| 96 |
# Training
|
| 97 |
|
| 98 |
-
## Flan UL2
|
| 99 |
The Flan-UL2 model was initialized using the `UL2` checkpoints, and was then trained additionally using Flan Prompting. This means that the original training corpus is `C4`,
|
| 100 |
|
|
|
|
|
|
|
|
|
|
| 101 |
|
| 102 |
## UL2 PreTraining
|
| 103 |
|
|
@@ -113,7 +142,7 @@ UL-20B was trained using the [Jax](https://github.com/google/jax) and [T5X](http
|
|
| 113 |
|
| 114 |
The training objective during pretraining is a mixture of different denoising strategies that are explained in the following:
|
| 115 |
|
| 116 |
-
|
| 117 |
|
| 118 |
To quote the paper:
|
| 119 |
> We conjecture that a strong universal model has to be exposed to solving diverse set of problems
|
|
@@ -164,7 +193,7 @@ In total, the model was trained for 2.65 million steps.
|
|
| 164 |
|
| 165 |
## Contribution
|
| 166 |
|
| 167 |
-
This model was contributed by [Younes Belkada](https://huggingface.co/
|
| 168 |
|
| 169 |
## Examples
|
| 170 |
|
|
|
|
| 39 |
It's not certain how many lessons you'll learn by your thirties. Does the
|
| 40 |
premise entail the hypothesis?
|
| 41 |
example_title: Premise and hypothesis
|
| 42 |
+
- text: >-
|
| 43 |
+
Answer the following question by reasoning step by step.
|
| 44 |
+
The cafeteria had 23 apples. If they used 20 for lunch, and bought 6 more, how many apple do they have?
|
| 45 |
+
example_title: Chain of thought
|
| 46 |
tags:
|
| 47 |
- text2text-generation
|
| 48 |
datasets:
|
|
|
|
| 60 |
license: apache-2.0
|
| 61 |
---
|
| 62 |
|
| 63 |
+
# TL;DR FLan-UL2
|
| 64 |
+
Flan-UL2 is an encoder decoder model based on the `T5` architecture. It uses the same configuration as the [`UL2 model`](https://huggingface.co/google/ul2) released earlier last year. It was fine tuned using the "Flan" prompt tuning
|
| 65 |
+
and dataset collection.
|
|
|
|
|
|
|
|
|
|
| 66 |
|
| 67 |
+
According ot the original [blog]() here are the notable improvements:
|
| 68 |
+
- The original UL2 model was only trained with receptive field of 512, which made it non-ideal for N-shot prompting where N is large.
|
| 69 |
+
- The Flan-UL2 checkpoint uses a receptive field of 2048 which makes it more usable for few-shot in-context learning.
|
| 70 |
+
- The original UL2 model also had mode switch tokens that was rather mandatory to get good performance. However, they were a little cumbersome as this requires often some changes during inference or finetuning. In this update/change, we continue training UL2 20B for an additional 100k steps (with small batch) to forget “mode tokens” before applying Flan instruction tuning. This Flan-UL2 checkpoint does not require mode tokens anymore.
|
| 71 |
|
| 72 |
+
## Converting from T5x to huggingface
|
| 73 |
+
You can use the [`convert_t5x_checkpoint_to_pytorch.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/convert_t5x_checkpoint_to_pytorch.py) script and pass the argument `strict = False`. The final layer norm is missing from the original dictionnary, that is why we are passing the `stric=False` argument.
|
| 74 |
+
```bash
|
| 75 |
+
python convert_t5x_checkpoint_to_pytorch.py --t5x_checkpoint_path ~/code/ul2/flan-ul220b-v3/ --config_file config.json --pytorch_dump_path ~/code/ul2/flan-ul2
|
| 76 |
+
```
|
| 77 |
+
## Performance improvment
|
| 78 |
|
| 79 |
The reported results are the following :
|
| 80 |
| | MMLU | BBH | MMLU-CoT | BBH-CoT | Avg |
|
|
|
|
| 84 |
| FLAN-T5-XXL 11B | 55.1 | 45.3 | 48.6 | 41.4 | 47.6 |
|
| 85 |
| FLAN-UL2 20B | 55.7(+1.1%) | 45.9(+1.3%) | 52.2(+7.4%) | 42.7(+3.1%) | 49.1(+3.2%) |
|
| 86 |
|
| 87 |
+
# Using the model
|
| 88 |
+
|
| 89 |
+
```python
|
| 90 |
+
from transformers import AutoModelForConditionalGeneration, AutoTokenizer
|
| 91 |
+
import torch
|
| 92 |
+
model = AutoModelForConditionalGeneration.from_pretrained("google/flan-ul2", device_map="auto", load_in_8bits = True)
|
| 93 |
+
tokenizer = AutoTokenizer.from_pretrained("google/flan-ul2")
|
| 94 |
+
|
| 95 |
+
input_string = "Answer the following question by reasoning step by step. The cafeteria had 23 apples. If they used 20 for lunch, and bought 6 more, how many apple do they have?"
|
| 96 |
+
|
| 97 |
+
inputs = tokenizer(input_string, return_tensors="pt").input_ids.to("cuda")
|
| 98 |
+
outputs = model.generate(inputs, max_length=200)
|
| 99 |
+
|
| 100 |
+
print(tokenizer.decode(outputs[0]))
|
| 101 |
+
# <pad> They have 23 - 20 = 3 apples left. They have 3 + 6 = 9 apples. Therefore, the answer is 9.</s>
|
| 102 |
+
|
| 103 |
+
```
|
| 104 |
|
| 105 |
+
|
| 106 |
+
# Introduction to UL2
|
| 107 |
|
| 108 |
UL2 is a unified framework for pretraining models that are universally effective across datasets and setups. UL2 uses Mixture-of-Denoisers (MoD), apre-training objective that combines diverse pre-training paradigms together. UL2 introduces a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes.
|
| 109 |
|
|
|
|
| 121 |
|
| 122 |
# Training
|
| 123 |
|
| 124 |
+
## Flan UL2
|
| 125 |
The Flan-UL2 model was initialized using the `UL2` checkpoints, and was then trained additionally using Flan Prompting. This means that the original training corpus is `C4`,
|
| 126 |
|
| 127 |
+
In “Scaling Instruction-Finetuned language models (Chung et al.)�� (also referred to sometimes as the Flan2 paper), the key idea is to train a large language model on a collection of datasets. These datasets are phrased as instructions which enable generalization across diverse tasks. Flan has been primarily trained on academic tasks. In Flan2, we released a series of T5 models ranging from 200M to 11B parameters that have been instruction tuned with Flan.
|
| 128 |
+
|
| 129 |
+
The Flan datasets have also been open sourced in “The Flan Collection: Designing Data and Methods for Effective Instruction Tuning” (Longpre et al.). See Google AI Blogpost: “The Flan Collection: Advancing Open Source Methods for Instruction Tuning”.
|
| 130 |
|
| 131 |
## UL2 PreTraining
|
| 132 |
|
|
|
|
| 142 |
|
| 143 |
The training objective during pretraining is a mixture of different denoising strategies that are explained in the following:
|
| 144 |
|
| 145 |
+
### Mixture of Denoisers
|
| 146 |
|
| 147 |
To quote the paper:
|
| 148 |
> We conjecture that a strong universal model has to be exposed to solving diverse set of problems
|
|
|
|
| 193 |
|
| 194 |
## Contribution
|
| 195 |
|
| 196 |
+
This model was contributed by [Younes Belkada](https://huggingface.co/ybelkada) & [Arthur Zucker](https://huggingface.co/ArthurZ).
|
| 197 |
|
| 198 |
## Examples
|
| 199 |
|