Skip to content

Setup and file locations on the Saarland servers

jgroschwitz edited this page Nov 24, 2023 · 14 revisions

The conda environment

First-time setup

  • cd to your home directory.
  • Create a file .condarc with content
envs_dirs:
  - /proj/irtg.shadow/conda/envs
  • Further create a file run_conda.sh with content
# set UTF encoding right
export LC_ALL=en_US.UTF-8

# run conda
. /proj/contrib/anaconda3/etc/profile.d/conda.sh

# for comet_ml
export https_proxy="http://www-proxy.uni-saarland.de:3128"
  • Clone the repository.

Note that the prediction scripts mentioned in the quick guide of the main readme (as well as the training scripts) download large model files (1.5G) and the large am-tools jar (.5G). Thus, it is recommended to clone this repository to a /local/your_username/ directory and work there and not to your home directory.

Whenever you log into the server

  • cd to your home directory and run
. run_conda.sh
conda activate allennlp

This activates the environment at /proj/irtg.shadow/conda/envs/allennlp.

  • cd to the cloned repository. You should now be good to run this parser.
  • To leave the environment, use conda deactivate allennlp.

Java version

The server only has Java 8 installed. This works with the automatically downloaded am-tools.jar. However, if you want to use a self-compiled am-tools.jar (e.g. from the new_decomposition branch), this will not work, since am-tools now relies on Java 11. So you will need to e.g. use a docker setup, where you can specify the Java version yourself.

One thing to note is that am-tools has a branch called java_8, which can be compiled with Java 8. At the time of this writing (November 2023), this version is up-to-date with the master branch of am-tools, except that it uses an older version of alto as a dependency.

File locations

Test sets for reproduction

The quick guide of the main readme has a documentation on how to reproduce our parsing results. The script requires as input the respective test set you want to parse (the -i option); these files exist on the saarland servers, you may need to be a member of the irtg group to access them. The file locations are:

Formalism Test set Path
DM in-domain test set /proj/irtg/sempardata/ACL2019/SemEval/2015/DM/test.id/en.id.dm.sdp
DM out-of-domain test set /proj/irtg/sempardata/ACL2019/SemEval/2015/DM/test.ood/en.ood.dm.sdp
PAS in-domain test set /proj/irtg/sempardata/ACL2019/SemEval/2015/PAS/test.id/en.id.pas.sdp
PAS out-of-domain test set /proj/irtg/sempardata/ACL2019/SemEval/2015/PAS/test.ood/en.ood.pas.sdp
PSD in-domain test set /proj/irtg/sempardata/ACL2019/SemEval/2015/PSD/test.id/en.id.psd.sdp
PSD out-of-domain test set /proj/irtg/sempardata/ACL2019/SemEval/2015/PSD/test.ood/en.ood.psd.sdp
EDS test set /proj/irtg/sempardata/ACL2019/EDS/original_data/test.amr
AMR 2017 test set /proj/irtg/amrtagging/amrtagging/corpora/abstract_meaning_representation_amr_2.0/data/amrs/split/test/
AMR 2015 test set* /proj/irtg/amrtagging/cleanAMRData_02-2019/AMRBank/LDC2015E86/test/

*note that the current setup uses a model trained on AMR 2017 as well as lookup data from that corpus for postprocessing, so the results on this set are not comparable

You can find training and dev sets in similar locations, except apparantly a dev set for EDS (there may not be one? -- EDIT: use data_preparation/EDS_split_train_dev.py).

Raw corpora

To execute preprocessing you need the original corpora. These files exist on the Saarland servers, you may need to be a member of the irtg group to access them. The file locations are:

Formalism Path
DM /proj/irtg/sempardata/sdp/2015/
PAS /proj/irtg/sempardata/sdp/2015/
PSD /proj/irtg/sempardata/sdp/2015/
EDS /proj/irtg/amrtagging/SDP/EDS/raw_data/
AMR 2017 /proj/corpora/abstract_meaning_representation_amr_2.0_LDC2017T10/abstract_meaning_representation_amr_2.0/data/amrs/split/
AMR 2015 /proj/irtg/amrtagging/cleanAMRData_02-2019/AMRBank/LDC2015E86

Preprocessed corpora

If you just want to train a new model and want to skip preprocessing, you can find the preprocessed train and dev sets on the Saarland servers. You may need to be a member of the irtg group to access them. The train.amconll and dev.amconll files of each formalism can be found in the train and dev directories of the following locations:

Formalism Path
DM /proj/irtg/sempardata/ACL2019/SemEval/2015/DM/
PAS /proj/irtg/sempardata/ACL2019/SemEval/2015/PAS/
PSD /proj/irtg/sempardata/ACL2019/SemEval/2015/PSD/
EDS /proj/irtg/sempardata/ACL2019/EDS/
AMR 2017 /proj/irtg/sempardata/ACL2019/AMR/2017/
AMR 2015 /proj/irtg/sempardata/ACL2019/AMR/2015/

MRP shared task file locations

The MRP data is all contained in /proj/irtg/sempardata/mrp.

  • the subdirectory LDC2019E45 contains the original data as distributed by the organizers, the companion data and a few preprocessed files, see the README.txt in LDC2019E45.
  • the subdirectory eval contains the test files + companion data.
  • the subdirectory amconll contains all files you need to train the parser.
Formalism Name of decomposition Training file Comment
DM - /proj/irtg/sempardata/mrp/amconll/DM/train/train.amconll
PSD - /proj/irtg/sempardata/mrp/amconll/PSD/train/train.amconll
EDS - /proj/irtg/sempardata/mrp/amconll/EDS/train/train.amconll
AMR clean_decomp /proj/irtg/sempardata/mrp/amconll/AMR/clean_decomp/train/train.amconll data used in submitted version, no extensive WordNet use, no CoreNLP
AMR after-mrp-wn-stanf /proj/irtg/sempardata/mrp/amconll/AMR/after-mrp-wn-stanf/train/train.amconll version used in paper, labeled as "improved + WordNet/Stanford"
UCCA very_first /proj/irtg/sempardata/mrp/amconll/UCCA/very_first/train/train.amconll submitted version
UCCA af_no_remote /proj/irtg/sempardata/mrp/amconll/UCCA/af_no_remote/train/train.amconll "improved version" in paper, no remote edges
UCCA af_remote /proj/irtg/sempardata/mrp/amconll/UCCA/af_remote/train/train.amconll not submitted, same as af_no_remote but remote edges kept

All folders for a graphbank have a specific structure:

  • train contains the train.amconll
  • dev contains an empty amconll file (=only sentences, no AM dep. trees) and corresponds to the entire dev set. There is also the file dev.mrp that contains the corresponding gold graphs.
  • gold-dev contains the AM dependency trees of the subset of the dev set that our heuristics could decompose. There is a corresponding file with gold graphs.
  • there is a folder called test that contains an empty amconll file with the test sentences.

In the case of UCCA, there is also another folder that contains the output from the preprocessing (in python) that is used as input to the decomposition.

Files used in Papers

  • Groschwitz et al. 2018: Original files are now at /local/jonasg/amrtagging/ on falken-3. @jgroschwitz also has a local backup copy.