Vine

Vine is a neural graph-based verbal MWE identification tool built on top of the backprop library.

Installation

First you will need to download and install the Haskell Tool Stack. Then run the following commands:

git clone https://github.com/kawu/vine.git
cd vine
stack install

The above command builds the vine command-line tool and (on Linux) puts it in the ~/.local/bin/ directory by default.

Data format

Vine works with the PARSEME .cupt (extended CoNLL-U) format.

Prerequisites

Before you can use the tool to identify verbal MWEs (VMWEs) in a given language, you will need to train the appropriate identification models. To this end, you will need:

A training file with VMWE annotations in the .cupt format
A file with the word embedding vectors correspondig to the individual words in the training file

You can find an example training set in the example directory.

Configuration

You need to specify the training configuration to train a model. The configuration includes the SGD parameters and the type of the objective function to optimize for. The default configuration can be found in config/config.dhall. The configuration is written in Dhall.

Training

Vine requires training one model per VMWE category (VID -- verbal idiom, LVC.full -- light-verb construction, etc.). See the PARSEME annotation guidelines for a description of the different categories employed in the PARSEME annotation framework.

Given the:

Training file train.cupt in the .cupt format
File with the corresponding training word emeddings train.embs
Training configuration config.dhall

You can use the following command to train a model for the identification of light verbs:

vine train -i train.cupt -e train.embs --mwe LVC.full --sgd config.dhall -o models/LVC.full.bin

where the -o option specifies the output model file. The tool will also save the intermediate model files (LVC.full.bin.N, where N is the number of the iteration) in the target models directory. You have to create this directory manually before training.

While training, the tool reports the value of the objective function, which should reach a relatively low level at the end of the process. In case of the example training set, the final objective value should be very close to 0 (provided that you use the training configuration present in the example directory), which represents a perfect fit of the model to the training data (modulo VMWE encoding errors).

Runtime options

You can use the GHC runtime system options to speed up training. For instance, you can make use of multiple cores using -N, or increase the allocation area size using -A. For example, to train the model using four threads and 256M allocation area size, run:

vine train -i train.cupt -e train.embs --mwe LVC.full --sgd config.dhall -o models/LVC.full.bin +RTS -A256M -N4

Tagging

Use the following command to tag a given test.blind.cupt file (blind is used here to mark a file with no VMWE annotations) with light verbs using a LVC.full.bin model file:

vine tag --constrained -t LVC.full -m LVC.full.bin -i test.blind.cupt -e test.embs

where test.embs is a file with the embedding vectors corresponding to the words in test.blind.cupt, and the --constrained (-c) option specifies that constrained prediction is to be used.

Clearing

The tag command does not erase the VMWE annotations present in the input file. If you want to tag a file already containing VMWE annotations, you can clear it first:

vine clear < test.cupt > test.blind.cupt

Ensemble averaging

The quality of the identification models can vary between the different training runs. To counteract this issue, you can train several models and use them together for tagging, for instance:

vine tag -c -t LVC.full -m LVC.full.1.bin -m LVC.full.2.bin -m LVC.full.3.bin -i test.blind.cupt -e test.embs

PARSEME-ST

Vine was evaluated on the PARSEME shared task corpus 1.1, as desribed in [1]. We provide the resulting trained models as well as the model predictions for the three datasets we experimented with (DE, FR, PL). We also provide the fastText embeddings for the German dataset, in case you'd like to try training a German model(s) yourself.

Comparison with SOTA

Vine outperforms the best systems submitted to the PARSEME shared tasks 1.1 on the three datasets we experimented with (when trained on both training and development data, as the other systems do).

Vine has similar general performance to ATILF, although the two systems make different tradeoffs. In particular, Vine is better in handling discontinuous and not-seen-in-training VMWE instances, while ATILF is better with continuous and seen-in-training VMWE occurrences.

The results achieved by Vine are a bit lower than those reported in Bridging the Gap: Attending to Discontinuity in Identification of Multiword Expressions (see the poster for comparison). However, Vine is conceptually simpler (no self-attention, no contextualized embeddings, no BiLSTM).

References

[1] Jakub Waszczuk, Rafael Ehren, Regina Stodden, Laura Kallmeyer, A Neural Graph-based Approach to Verbal MWE Identification, Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019), Florence, Italy, 2019 (https://www.aclweb.org/anthology/W19-5113, poster).

Name		Name	Last commit message	Last commit date
Latest commit History 188 Commits
app		app
config		config
example		example
src		src
test		test
ChangeLog.md		ChangeLog.md
LICENSE		LICENSE
README.md		README.md
Setup.hs		Setup.hs
package.yaml		package.yaml
stack.yaml		stack.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Vine

Installation

Data format

Prerequisites

Configuration

Training

Runtime options

Tagging

Clearing

Ensemble averaging

PARSEME-ST

Comparison with SOTA

References

About

Uh oh!

Releases

Packages

Languages

License

kawu/vine

Folders and files

Latest commit

History

Repository files navigation

Vine

Installation

Data format

Prerequisites

Configuration

Training

Runtime options

Tagging

Clearing

Ensemble averaging

PARSEME-ST

Comparison with SOTA

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages