Causeway is a system for detecting explicit causal relations in text. It tags text using the BECAUSE 1.0 annotation scheme, described in Dunietz et al., 2015. The system itself is described in Dunietz et al., 2017.
Note that the repository includes some code for reading in data in an updated version of the annotation scheme (BECAUSE 2.x). This newer scheme is backwards-compatible with the original.
The steps to reproduce the results from the 2017 Causeway paper are given below. If you have any difficulty doing so or have additional questions, please contact Jesse Dunietz, who will be happy to assist.
NOTE: You may also be interested in DeepCx, a neural network tagger that supersedes Causeway. DeepCx achieves substantially better performance on all versions of the BECAUSE dataset.
To reproduce the results from the Causeway paper:
-
You'll want to do this in Ubuntu, the only platform Causeway has been tested on. It may work on other *nix platforms, but you'll be on your own for getting it to do so.
You'll need some standard Ubuntu packages, which you can install using
apt
if you don't have them:sudo apt install git python2 python-pip sed task-spooler default-jdk # or any JDK
-
Install the external Python packages that Causeway depends on:
sudo pip2 install bidict colorama nltk cython python-gflags numpy scipy scikit-learn python-crfsuite
Also make sure that NLTK has access to WordNet:
python2 -c "import nltk; nltk.download('wordnet')"
-
Clone the Causeway repository, including the NLPypline framework for NLP pipelines (included as a Git submodule):
git clone --recursive https://github.com/duncanka/Causeway.git
We'll refer to the resulting
Causeway
directory as$CAUSEWAY_DIR
. -
Compile the one Cython file in the project:
(cd $CAUSEWAY_DIR/NLPypline/src/nlpypline/util && cythonize -i streams.pyx)
-
Reconstitute the BECAUSE 1.0 corpus. (Of course, you can also use the latest version of BECAUSE if you are not trying to reproduce the Causeway paper results.)
-
Clone the repository from whatever directory you'd like the data to live in.
git clone https://github.com/duncanka/BECAUSE.git (cd BECAUSE && git checkout 1.0) # skip if using latest BECAUSE version
We'll refer to the resulting directory named
BECAUSE
as$BECAUSE_DIR
. -
Extract the raw WSJ text corresponding to the PTB subset used in BECAUSE. Assuming you have the PTB2 files unpacked in
$PTB_DIR
(with the same directory structure as the official CD), run the following:for ANN_FILE in $BECAUSE_DIR/PTB/*.ann; do BASE_FILE=$(basename $ANN_FILE) DIGITS=$(echo $BASE_FILE | cut -d'_' -f2) tail -n +3 $PTB_DIR/raw/${DIGITS:0:2}/${BASE_FILE%.*} > $BECAUSE_DIR/PTB/${BASE_FILE%.*}.txt done
You should end up with a bunch of
.txt
files alongside the.ann
files in thePTB
subdirectory. -
Run the NYT text extraction script on your LDC-licensed copy of the NYT corpus, which let's assume is stored in directory
$NYT_DIR
:python2 $BECAUSE_DIR/scripts/extract_nyt_txt.py $BECAUSE_DIR/NYT $(for FNAME in $BECAUSE_DIR/NYT/*.ann; do find $NYT_DIR -name $(basename "${FNAME%.ann}.xml"); done)
Again, you should end up with a bunch of
.txt
files alongside the.ann
files in theNYT
subdirectory.
-
-
Set up version 3.5.2 of the Stanford parser.
-
Download the full Stanford CoreNLP package. Unzip it somewhere, resulting in a folder called
stanford-corenlp-full-2015-04-20
(henceforth,$STANFORD_DIR
). -
Unzip the pretrained PCFG and NER models:
unzip $STANFORD_DIR/stanford-corenlp-3.5.2-models.jar edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz -d $STANFORD_DIR unzip -j $STANFORD_DIR/stanford-corenlp-3.5.2-models.jar edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz -d $STANFORD_DIR/classifiers
-
Apply the Causeway-specific patches to the Stanford parser. The following hacky script should do the trick:
mkdir /tmp/stanford-sources unzip $STANFORD_DIR/stanford-corenlp-3.5.2-sources.jar -d /tmp/stanford-sources cp $CAUSEWAY_DIR/stanford-patches/*.patch /tmp/stanford-sources (cd /tmp/stanford-sources && { for PATCH in *.patch; do patch -p 2 < $PATCH done }) TO_RECOMPILE=$(grep '+++' /tmp/stanford-sources/*.patch | sed -e 's/.*\(edu.*\.java\).*/\1/' | sort | uniq) for SRC_FILE in $TO_RECOMPILE; do javac -cp /tmp/stanford-sources "/tmp/stanford-sources/$SRC_FILE" for CLASS_FILE in /tmp/stanford-sources/${SRC_FILE%.java}*.class; do jar uf $STANFORD_DIR/stanford-corenlp-3.5.2.jar -C /tmp/stanford-sources/ "${CLASS_FILE#/*/*/}" done done rm -R /tmp/stanford-sources
You might see a bit of error output from the Java compiler. Don't worry about it.
-
Create the TRegex/TSurgeon run scripts (adapted from the standalone TRegex download).
printf '#!/bin/bash\nexport CLASSPATH=$(dirname $0)/stanford-corenlp-3.5.2.jar:$CLASSPATH\njava -mx100m edu.stanford.nlp.trees.tregex.TregexPattern "$@"\n' > $STANFORD_DIR/tregex.sh printf '#!/bin/bash\nexport CLASSPATH=$(dirname $0)/stanford-corenlp-3.5.2.jar:$CLASSPATH\njava -mx100m edu.stanford.nlp.trees.tregex.tsurgeon.Tsurgeon "$@"\n' > $STANFORD_DIR/tsurgeon.sh chmod ugo+x $STANFORD_DIR/tregex.sh $STANFORD_DIR/tsurgeon.sh
-
-
Run the Stanford parser on the data:
for DATA_DIR in $BECAUSE_DIR/PTB $BECAUSE_DIR/NYT $BECAUSE_DIR/CongressionalHearings; do $CAUSEWAY_DIR/scripts/preprocess.sh $DATA_DIR $STANFORD_DIR done
-
For the PTB files, extract the gold-standard parse trees (to enable gold-standard parse experiments).
- Correct a silly PTB tokenization error in one of the
.mrg
files that breaks the system:(If you don't want to modify your main PTB copy, you can copy the PTB data over to a new directory and point(cd $PTB_DIR/combined/14/ && patch -p1 < $CAUSEWAY_DIR/wsj_1457.mrg.patch)
$PTB_DIR
to it.) - Run the command to extract the trees:
$CAUSEWAY_DIR/scripts/convert-mrg.sh $BECAUSE_DIR/PTB $PTB_DIR/combined $STANFORD_DIR
- Correct a silly PTB tokenization error in one of the
-
Run the system.
- Edit the
BECAUSE_DIR
andSTANFORD_DIR
variables inrun_all_pipelines.sh
to match your setup. - Run the script from the root Causeway directory.
- Edit the
Dunietz, Jesse, Lori Levin, and Jaime Carbonell. Automatically Tagging Constructions of Causation and Their Slot-Fillers. In press; to be published in 2017. Transactions of the Association for Computational Linguistics.
Dunietz, Jesse, Lori Levin, and Jaime Carbonell. Annotating Causal Language Using Corpus Lexicography of Constructions. Proceedings of LAW IX – The 9th Linguistic Annotation Workshop (2015): 188-196.