Extractive sentence compression shortens a source sentence S to a shorter compression C by removing words from S.
For instance:
S: Gazprom the Russian state gas giant announced a 40 percent increase in the price of natural gas sold to Ukraine which is heavily dependent on Russia for its gas supply.
C: Gazprom announced a 40 percent increase in the price of gas sold to Ukraine.
This repo presents our linear-time, query-focused sentence compression technique. Given a source sentence S and a set of query tokens Q, we produce a C that contains all of the words in Q and is shorter than some character budget b.
Our method is much faster than ILP-based methods, another class of algorithms that can also perform query-focused compression. We describe our method in our companion paper.
bottom_up_clean
code for vertex addition is herecode
utilities, such as printers, loggers and significance testersdead_code
old code not in useilp2013
F & A implementationemnlp
paper & writingklm
some utilties for computing slorpaperzip
has .tex for softconf, for XML proceedingspreproc
holds preprocessing codescripts
runs experimentssnapshots
ILP weights, learned from training. Committed for replicability b/c ILP training takes days
scripts/test_timing_results.sh
scripts/rollup_times.R
scripts/latencies.R
- The script
make_results_master.ipynb
gets the numbers for this table based on two files:bottom_up_clean/results.csv
bottom_up_clean/all_times_rollup.csv
- Note: this notebook also runs
scripts/latencies.R
to make figure 3 - Those results files are created via the script
scripts/test_results.sh
- The plot
emnlp/times.pdf
comes fromscripts/latencies.R
R version 3.4.4 (2018-03-15) -- "Someone to Lean On"
Tidyverse version 1.2.1
- The neural net uses models/125249540
- The params of the network are stored in the AllenNLP config file
models/125249540/config.json
The train/test data is packaged as preproc/*.paths files (for oracle path). These files are created by the preprocessing scripts ($fab preproc
). These files are actually jsonl but not a priority to rename them. They were once pickled.
Some of these files are too big to commit directly (even zipped) but split and zipped forms are included in the repo
To remake them from the split/zipped versions run ./scripts/unzip_paths.sh