Dataset for the EMNLP 2020 paper, Multi-XScience: A Large-scale Dataset for Extreme Multi-document Summarization of Scientific Articles.
Authors: Yao Lu, Yue Dong, Laurent Charlin
Appendix: model implementation and evaluation details.
word-level statistics
train/val/test examples | average document length | summary length | number of references |
---|---|---|---|
30,369/5,066/5,093 | 778.08 | 116.44 | 4.42 |
We also calculate the percentage of novel n-grams in the target summary of previous datasets. Three of them are single-document summarization datasets. Our dataset has the highest abstractiveness among all existing multi-document summarization datasets.
Datasets | % of novel unigram | % of novel bi-grams | % of novel tri-grams | % of novel 4-grams |
---|---|---|---|---|
CNN-DailyMail (single) | 17.00 | 53.91 | 71.98 | 80.29 |
NY Times (single) | 22.64 | 55.59 | 71.93 | 80.16 |
XSum (single) | 35.76 | 83.45 | 95.50 | 98.49 |
WikiSum | 18.20 | 51.88 | 69.82 | 78.16 |
Multi-News | 17.76 | 57.10 | 75.71 | 82.30 |
Multi-XScience | 42.33 | 81.75 | 94.57 | 97.62 |
key | description |
---|---|
aid | arxiv id (e.g. 2010.14235) |
mid | microsoft academic graph id |
abstract | text of paper abstract |
ref_abstract | meta-information of reference papers |
ref_abstract.cite_N | meta-information of reference paper cite_N (special cite symbol) |
ref_abstract.cite_N.mid | reference paper's (cite_N) microsoft academic graph id |
ref_abstract.cite_N.abstract | text of reference paper (cite_N) abstract |
Our dataset is aligned with Microsoft Academic Graph. Anyone interested in the intersection of graph and summarization can use our dataset for exploration.