BART-large-quotes

BART [1] fine-tuned for extractive summarization on a dataset of movie and book quotes. Continued fine-tuning from the BART-large-cnn checkpoint, which was fine-tuned on the CNN Daily Mail, which is more extractive than abstractive.

Compare: The smaller model BART-base-quotes achieved slightly smaller ROUGE scores, but favors shorter quotes (~1/4 length on average).

Training Description

Dataset

The model was trained on 11295 quotes, comprising 6280 movie quotes from the Cornell Movie Quotes dataset [2] and 5015 book quotes from the T50 dataset [3]. As described in the T50 paper, each movie quote is accompanied by a context of 4 sentences each on the left and the right, while 10 sentences are used for book quotes. Training/Development/Test splits of proportions 7:1:2 were created with stratified sampling. The tables below report the sample sizes in each data split and the length statistics of the contexts and quotes in each sample.

Split	Total	Movie	Book
Train	7906	4396	3510
Dev	1130	628	502
Test	2259	1256	1003

Data	min	median	max	mean ± std
Movie Context	38	148	3358	167.13 ± 102.57
Movie Quote	5	20	592	28.22 ± 27.79
T50 Context	86	628	6497	659.14 ± 310.49
T50 Quote	6	41	877	61.87 ± 63.89
Total Context	38	246	6497	385.58 ± 329.258
Total Quote	5	26	877	43.16 ± 50.21

Parameters

Each experiment uses a max input length of 1024 and a max output length of 128 to account for the short average length of quotes. While there is a significant variance in the length of quotes, poignant statements are of the most interest.

Each BART model is trained with a batch size of 32 for 30 epochs (7440 steps) using AdamW with 0.01 weight decay and a linearly annealing learning rate of 5e-5. The first 5% of steps, i.e., 1.5 epochs, are used for a linear warmup. The model is evaluated every 500 steps w.r.t. ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-Lsum. After the training, the checkpoint with the best eval_rougeL is loaded to prefer extractive over abstractive summarization. FP16 mixed precision is used.

In addition, T5-base [4] is evaluated with a batch size of 8 (29670 steps) due to the greater memory footprint, and a peak learning rate of 3e-4.

The learning rates were chosen empirically on shorter training runs of 5 epochs.

Evaluation

Since no data splits were published with the T50 paper [3], the results are not fully reproducible, and models are evaluated on the previously described training data. Rather than using the whole test set at once for evaluation, it is split into 3 equally-sized disjoint random samples of size 753. Each model is evaluated on all 3 samples, and the mean scores and 95% confidence interval for all scores are reported below. Additionally, the table includes the average predicted quote length, the number of epochs of the best training checkpoint, and the total training time.

Checkpoint	ROUGE-1	ROUGE-2	ROUGE-L	ROUGE-Lsum	Avg Quote Length	Epochs	Time
T5-base	0.3758 ± 0.0175	0.2990 ± 0.0128	0.3628 ± 0.0189	0.3684 ± 0.0201	18.1576 ± 0.1084	1.01	3:39:14
BART-base	0.4236 ± 0.0133	0.3498 ± 0.0116	0.4112 ± 0.0135	0.4165 ± 0.0107	19.1027 ± 0.1755	12.10	0:44:48
BART-large	0.4252 ± 0.0240	0.3456 ± 0.0204	0.4115 ± 0.0206	0.4171 ± 0.0209	19.2877 ± 0.1819	6.05	2:43:56
BART-large-cnn	0.4384 ± 0.0225	0.3693 ± 0.0197	0.4165 ± 0.0239	0.4317 ± 0.0234	81.8623 ± 1.5324	28.23	3:48:24

References

[1] BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
[2] You Had Me at Hello: How Phrasing Affects Memorability
[3] Quote Detection: A New Task and Dataset for NLP
[4] Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer