This repository contains the dataset, model weights, and generation code for our paper "TLDR: Extreme Summarization of Scientific Documents".
A running demo of our model can be found here.
We use Fairseq to train and evaluate our models. To install all requirements, run pip install -r requirements.txt
For the evaluation, you will need files2rouge
.
Please install my fork of the repo.
In order to format the data to work for the Fairseq library, run:
$ cd SciTLDR-Data
$ export TASK=SciTLDR-A # Choose from {A, AIC, FullText}
$ python to_stories.py $TASK # Convert to story format
$ chmod +x make_datafiles.sh
$ ./make_datafiles.sh # BPE preprocess
This code takes in a test.source
file, in which each line is an input and outputs a test.hypo
file with the predictions. It imports a test.jsonl
file as a reference and stores the rouge score in test.hypo.score
.
$ python evaluate.py SciTLDR-Data/SciTLDR-A /path/to/model/dir/ --checkpoint_file scitldr_ao_model.pt --beam 4 --lenpen 0.6
OR
$ python evaluate.py SciTLDR-Data/SciTLDR-AIC /path/to/model/dir/ --checkpoint_file scitldr_aic_model.pt --beam 2 --lenpen 0.2
If you use our code, dataset, or model weights in your research, please cite "TLDR: Extreme Summarization of Scientific Documents."
@article{cachola2020tldr,
title={{TLDR}: Extreme Summarization of Scientific Documents},
author={Isabel Cachola and Kyle Lo and Arman Cohan and Daniel S. Weld},
journal={arXiv:2004.15011},
year={2020},
}
SciTLDR is an open-source project developed by the Allen Institute for Artificial Intelligence (AI2). AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.