- Requirements
- GloVe
- Data Format
- Train a Supertagger
- Structure of the Code
- Jackknife POS Tagging
- Run a pre-trained TAG Parser
- Run demo
- Notes
TensorFlow needs to be installed before running the training script. TensorFlow 1.0.0 or higher is supported. (TensorFlow 1.3.0 and 1.7.0 are tested)
Our architecture utilizes pre-trained word embedding vectors, GloveVectors. Run the following:
wget http://nlp.stanford.edu/data/glove.6B.zip
and save it to a sub-directory glovevector/.
The biaffine parser takes as input a file in the Conllu+Supertag (conllustag) format, in which one column for supertags is added to the original conllu format at the end. See a sample.
All you need to do is to create a new directory for your data in the conllustag format and a json file for the model configuration and data information. We provide a sample json file for the sample data directory. You can train a parser on the sample data by the following command:
python train_graph_parser.py sample_data/config_demo.json
After running this command, you should be getting the following files and directories in sample_data/:
Directory/File | Description |
---|---|
checkpoint.txt | Contains information about the best model. |
sents/ | Contains the words in the one-sentence-per-line format |
gold_pos/ | Contains the gold POS tags in the one-sentence-per-line format |
gold_stag/ | Contains the gold supertags in the one-sentence-per-line format |
arcs/ | Contains the gold arcs in the one-sentence-per-line format |
rels/ | Contains the gold rels in the one-sentence-per-line format |
predicted_arcs/ | Contains the predicted arcs in the one-sentence-per-line format |
predicted_rels/ | Contains the gold rels in the one-sentence-per-line format |
Parsing_Models/ | Stores the best model. |
conllu/sample.conllustag_stag | Contains the predicted supertags in the conllustag format |
File | Description |
---|---|
utils/preprocessing.py |
Contains tools for preprocessing. Mainly for tokenizing and indexing words/tags. Gets imported to utils/data_process_secsplit.py |
utils/data_process_secsplit.py |
Reads training and test data and tokenize/index words, POS tags, stags, and characters. |
utils/parsing_model.py |
Contains the Parsing_Model class that constructs our LSTM computation graph. The class has the necessary methods for training and testing. Gets imported to bilstm_stagger_model.py . For more details, read README for utils. |
utils/lstm.py |
Contains tensorflow LSTM equations. Gets imported to utils/stagging_model.py . |
graph_parser_model.py |
Contains functions that instantiate the Parsing_Model class and train/test a model. Gets imported to graph_parser_main.py |
graph_parser_main.py |
Main file to run experiments. Reads model and data options. |
scripts/train_graph_parser.py |
Runs graph_parser_main.py in bash according to the json file that gets passed. |
We provide a pre-trained TAG parser.
To run the pretrained parser on your data, first download the model and place the Pretrained_Parser
directory in the demo
directory. Then, run
- With tokenization
python demo/scripts/demo_model.py --infile demo/sents/test.txt --tokenize
- Without tokenization
python demo/scripts/demo_model.py --infile demo/sents/test.tokenized.txt
You can replace these files by your own data. It prints out predicted elementary trees, parses, and PTB-style fine-grained POS tags in the conllu format. Note that we put elementary trees in the UPOS column.
Enjoy playing around with our TAG demo parser online!
If you use this tool for your research, please consider citing:
@InProceedings{Kasai&al.18,
author = {Jungo Kasai and Robert Frank and Pauli Xu and William Merrill and Owen Rambow},
title = {End-to-end Graph-based TAG Parsing with Neural Networks},
year = {2018},
booktitle = {Proceedings of NAACL},
publisher = {Association for Computational Linguistics},
}