BART-large-quotes
BART [1] fine-tuned for extractive summarization on a dataset of movie and book quotes. Continued fine-tuning from the BART-large-cnn checkpoint, which was fine-tuned on the CNN Daily Mail, which is more extractive than abstractive.
Compare: The smaller model BART-base-quotes achieved slightly smaller ROUGE scores, but favors shorter quotes (~1/4 length on average).
Training Description
Dataset
The model was trained on 11295 quotes, comprising 6280 movie quotes from the Cornell Movie Quotes dataset [2] and 5015 book quotes from the T50 dataset [3]. As described in the T50 paper, each movie quote is accompanied by a context of 4 sentences each on the left and the right, while 10 sentences are used for book quotes. Training/Development/Test splits of proportions 7:1:2 were created with stratified sampling. The tables below report the sample sizes in each data split and the length statistics of the contexts and quotes in each sample.
Split | Total | Movie | Book |
---|---|---|---|
Train | 7906 | 4396 | 3510 |
Dev | 1130 | 628 | 502 |
Test | 2259 | 1256 | 1003 |
Data | min | median | max | mean ± std |
---|---|---|---|---|
Movie Context | 38 | 148 | 3358 | 167.13 ± 102.57 |
Movie Quote | 5 | 20 | 592 | 28.22 ± 27.79 |
T50 Context | 86 | 628 | 6497 | 659.14 ± 310.49 |
T50 Quote | 6 | 41 | 877 | 61.87 ± 63.89 |
Total Context | 38 | 246 | 6497 | 385.58 ± 329.258 |
Total Quote | 5 | 26 | 877 | 43.16 ± 50.21 |
Parameters
Each experiment uses a max input length of 1024 and a max output length of 128 to account for the short average length of quotes. While there is a significant variance in the length of quotes, poignant statements are of the most interest.
Each BART model is trained with a batch size of 32 for 30 epochs (7440 steps) using AdamW with 0.01 weight decay and a linearly annealing learning rate of 5e-5. The first 5% of steps, i.e., 1.5 epochs, are used for a linear warmup. The model is evaluated every 500 steps w.r.t. ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-Lsum. After the training, the checkpoint with the best eval_rougeL is loaded to prefer extractive over abstractive summarization. FP16 mixed precision is used.
In addition, T5-base [4] is evaluated with a batch size of 8 (29670 steps) due to the greater memory footprint, and a peak learning rate of 3e-4.
The learning rates were chosen empirically on shorter training runs of 5 epochs.
Evaluation
Since no data splits were published with the T50 paper [3], the results are not fully reproducible, and models are evaluated on the previously described training data. Rather than using the whole test set at once for evaluation, it is split into 3 equally-sized disjoint random samples of size 753. Each model is evaluated on all 3 samples, and the mean scores and 95% confidence interval for all scores are reported below. Additionally, the table includes the average predicted quote length, the number of epochs of the best training checkpoint, and the total training time.
Checkpoint | ROUGE-1 | ROUGE-2 | ROUGE-L | ROUGE-Lsum | Avg Quote Length | Epochs | Time |
---|---|---|---|---|---|---|---|
T5-base | 0.3758 ± 0.0175 | 0.2990 ± 0.0128 | 0.3628 ± 0.0189 | 0.3684 ± 0.0201 | 18.1576 ± 0.1084 | 1.01 | 3:39:14 |
BART-base | 0.4236 ± 0.0133 | 0.3498 ± 0.0116 | 0.4112 ± 0.0135 | 0.4165 ± 0.0107 | 19.1027 ± 0.1755 | 12.10 | 0:44:48 |
BART-large | 0.4252 ± 0.0240 | 0.3456 ± 0.0204 | 0.4115 ± 0.0206 | 0.4171 ± 0.0209 | 19.2877 ± 0.1819 | 6.05 | 2:43:56 |
BART-large-cnn | 0.4384 ± 0.0225 | 0.3693 ± 0.0197 | 0.4165 ± 0.0239 | 0.4317 ± 0.0234 | 81.8623 ± 1.5324 | 28.23 | 3:48:24 |
References
[1] BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
[2] You Had Me at Hello: How Phrasing Affects Memorability
[3] Quote Detection: A New Task and Dataset for NLP
[4] Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
- Downloads last month
- 11
Model tree for ChrisBridges/bart-large-quotes
Base model
facebook/bart-large-cnn