BART-large-quotes

BART [1] fine-tuned for extractive summarization on a dataset of movie and book quotes. Continued fine-tuning from the BART-large-cnn checkpoint, which was fine-tuned on the CNN Daily Mail, which is more extractive than abstractive.

Compare: The smaller model BART-base-quotes achieved slightly smaller ROUGE scores, but favors shorter quotes (~1/4 length on average).

Training Description

Dataset

The model was trained on 11295 quotes, comprising 6280 movie quotes from the Cornell Movie Quotes dataset [2] and 5015 book quotes from the T50 dataset [3]. As described in the T50 paper, each movie quote is accompanied by a context of 4 sentences each on the left and the right, while 10 sentences are used for book quotes. Training/Development/Test splits of proportions 7:1:2 were created with stratified sampling. The tables below report the sample sizes in each data split and the length statistics of the contexts and quotes in each sample.

Split Total Movie Book
Train 7906 4396 3510
Dev 1130 628 502
Test 2259 1256 1003
Data min median max mean ± std
Movie Context 38 148 3358 167.13 ± 102.57
Movie Quote 5 20 592 28.22 ± 27.79
T50 Context 86 628 6497 659.14 ± 310.49
T50 Quote 6 41 877 61.87 ± 63.89
Total Context 38 246 6497 385.58 ± 329.258
Total Quote 5 26 877 43.16 ± 50.21

Parameters

Each experiment uses a max input length of 1024 and a max output length of 128 to account for the short average length of quotes. While there is a significant variance in the length of quotes, poignant statements are of the most interest.

Each BART model is trained with a batch size of 32 for 30 epochs (7440 steps) using AdamW with 0.01 weight decay and a linearly annealing learning rate of 5e-5. The first 5% of steps, i.e., 1.5 epochs, are used for a linear warmup. The model is evaluated every 500 steps w.r.t. ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-Lsum. After the training, the checkpoint with the best eval_rougeL is loaded to prefer extractive over abstractive summarization. FP16 mixed precision is used.

In addition, T5-base [4] is evaluated with a batch size of 8 (29670 steps) due to the greater memory footprint, and a peak learning rate of 3e-4.

The learning rates were chosen empirically on shorter training runs of 5 epochs.

Evaluation

Since no data splits were published with the T50 paper [3], the results are not fully reproducible, and models are evaluated on the previously described training data. Rather than using the whole test set at once for evaluation, it is split into 3 equally-sized disjoint random samples of size 753. Each model is evaluated on all 3 samples, and the mean scores and 95% confidence interval for all scores are reported below. Additionally, the table includes the average predicted quote length, the number of epochs of the best training checkpoint, and the total training time.

Checkpoint ROUGE-1 ROUGE-2 ROUGE-L ROUGE-Lsum Avg Quote Length Epochs Time
T5-base 0.3758 ± 0.0175 0.2990 ± 0.0128 0.3628 ± 0.0189 0.3684 ± 0.0201 18.1576 ± 0.1084 1.01 3:39:14
BART-base 0.4236 ± 0.0133 0.3498 ± 0.0116 0.4112 ± 0.0135 0.4165 ± 0.0107 19.1027 ± 0.1755 12.10 0:44:48
BART-large 0.4252 ± 0.0240 0.3456 ± 0.0204 0.4115 ± 0.0206 0.4171 ± 0.0209 19.2877 ± 0.1819 6.05 2:43:56
BART-large-cnn 0.4384 ± 0.0225 0.3693 ± 0.0197 0.4165 ± 0.0239 0.4317 ± 0.0234 81.8623 ± 1.5324 28.23 3:48:24

References

[1] BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
[2] You Had Me at Hello: How Phrasing Affects Memorability
[3] Quote Detection: A New Task and Dataset for NLP
[4] Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Downloads last month
11
Safetensors
Model size
406M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ChrisBridges/bart-large-quotes

Finetuned
(374)
this model