Skip to content

UCCA preprocessing

Matthias Lindemann edited this page Jul 23, 2019 · 5 revisions

Producing UCCA training data for the AM parser

Step 1: Extract tokens from companion data

mkdir -p data_ucca/companion_tokens
python ucca/get_companion_tokenization.py data_ucca/companion/ucca/ data_ucca/companion_tokens/

Step 2: Create Alto corpora

mkdir -p data_ucca/alto_corpus
python ucca/convert_training_into_alto_corpus.py data_ucca/ucca/ data_ucca/companion_tokens/ data_ucca/alto_corpus/

This will take a few minutes, and will then write two files in Alto format, training.txt and dev.txt, to the alto_corpus directory. Moreover, it produces two MRP files that correspond to the split.

Step 3: Decompose and write AM-CoNLL file

Once we run CreateCorpus from am-tools, we obtain an am-conll file. This file will serve as input to the am-parser model.

mkdir -p data_ucca/amconll_corpus
java -cp ../am-tools/build/libs/am-tools-all.jar de.saar.coli.amrtagging.formalisms.ucca.tools.CreateCorpusParallel -c data_ucca/alto_corpus/training.txt -o data_ucca/amconll_corpus -p training --companion data_ucca/companion/all_ucca.conllu

This will take about an hour. Feel free to specify a timeout to speed this up.

The outcome will be two files in the amconll_corpus subdirectory: one AM-CoNLL file and one file with supertags.

Contraction sanity check

To find out how accurately the contractions can be reversed, continue as follows:

Step 4: Evaluate AM-CoNLL file to MRP graphs

java -Xmx8G -cp ../am-tools/build/libs/am-tools-all.jar de.saar.coli.amrtagging.mrp.tools.EvaluateMRP --corpus data_ucca/amconll_corpus/training.amconll --out data_ucca/training.mrp

Step 5: Uncontract edges in MRP graphs

python ucca/decompress_mrp.py data_ucca/training.mrp data_ucca/training_uncontracted.mrp

Step 6: Remove labels

python ucca/remove_labels.py data_ucca/training_uncontracted.mrp data_ucca/training_uncontracted_no_labels.mrp

Step 7: Compare against original corpus

python ../mtool/main.py --read mrp --score mrp --gold data_ucca/ucca/ewt.mrp data_ucca/training_uncontracted_no_labels.mrp