-
Notifications
You must be signed in to change notification settings - Fork 141
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
5 changed files
with
201 additions
and
208 deletions.
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
## Constituency Parsing | ||
|
||
Command for training `crf` constituency parser is simple. | ||
We follow instructions of [Benepar](https://github.com/nikitakit/self-attentive-parser) to preprocess the data. | ||
|
||
To train a BiLSTM-based model: | ||
```sh | ||
$ python -u -m supar.cmds.const.crf train -b -d 0 -c con-crf-en -p model -f char --mbr | ||
--train ptb/train.pid \ | ||
--dev ptb/dev.pid \ | ||
--test ptb/test.pid \ | ||
--embed glove-6b-100 \ | ||
--mbr | ||
``` | ||
|
||
To finetune [`robert-large`](https://huggingface.co/roberta-large): | ||
```sh | ||
$ python -u -m supar.cmds.const.crf train -b -d 0 -c con-crf-roberta-en -p model \ | ||
--train ptb/train.pid \ | ||
--dev ptb/dev.pid \ | ||
--test ptb/test.pid \ | ||
--encoder=bert \ | ||
--bert=roberta-large \ | ||
--lr=5e-5 \ | ||
--lr-rate=20 \ | ||
--batch-size=5000 \ | ||
--epochs=10 \ | ||
--update-steps=4 | ||
``` | ||
|
||
The command for finetuning [`xlm-roberta-large`](https://huggingface.co/xlm-roberta-large) on merged treebanks of 9 languages in SPMRL dataset is: | ||
```sh | ||
$ python -u -m supar.cmds.const.crf train -b -d 0 -c con-crf-roberta-en -p model \ | ||
--train spmrl/train.pid \ | ||
--dev spmrl/dev.pid \ | ||
--test spmrl/test.pid \ | ||
--encoder=bert \ | ||
--bert=xlm-roberta-large \ | ||
--lr=5e-5 \ | ||
--lr-rate=20 \ | ||
--batch-size=5000 \ | ||
--epochs=10 \ | ||
--update-steps=4 | ||
``` | ||
|
||
Different from conventional evaluation manner of executing `EVALB`, we internally integrate python code for constituency tree evaluation. | ||
As different treebanks do not share the same evaluation parameters, it is recommended to evaluate the results in interactive mode. | ||
|
||
To evaluate English and Chinese models: | ||
```py | ||
>>> Parser.load('con-crf-en').evaluate('ptb/test.pid', | ||
delete={'TOP', 'S1', '-NONE-', ',', ':', '``', "''", '.', '?', '!', ''}, | ||
equal={'ADVP': 'PRT'}, | ||
verbose=False) | ||
(0.21318972731630007, UCM: 50.08% LCM: 47.56% UP: 94.89% UR: 94.71% UF: 94.80% LP: 94.16% LR: 93.98% LF: 94.07%) | ||
>>> Parser.load('con-crf-zh').evaluate('ctb7/test.pid', | ||
delete={'TOP', 'S1', '-NONE-', ',', ':', '``', "''", '.', '?', '!', ''}, | ||
equal={'ADVP': 'PRT'}, | ||
verbose=False) | ||
(0.3994724107416053, UCM: 24.96% LCM: 23.39% UP: 90.88% UR: 90.47% UF: 90.68% LP: 88.82% LR: 88.42% LF: 88.62%) | ||
``` | ||
|
||
To evaluate the multilingual model: | ||
```py | ||
>>> Parser.load('con-crf-xlmr').evaluate('spmrl/eu/test.pid', | ||
delete={'TOP', 'ROOT', 'S1', '-NONE-', 'VROOT'}, | ||
equal={}, | ||
verbose=False) | ||
(0.45620645582675934, UCM: 53.07% LCM: 48.10% UP: 94.74% UR: 95.53% UF: 95.14% LP: 93.29% LR: 94.07% LF: 93.68%) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
# Dependency Parsing | ||
|
||
Below are examples of training `biaffine` and `crf2o` dependency parsers on PTB. | ||
|
||
```sh | ||
# biaffine | ||
$ python -u -m supar.cmds.dep.biaffine train -b -d 0 -c dep-biaffine-en -p model -f char \ | ||
--train ptb/train.conllx \ | ||
--dev ptb/dev.conllx \ | ||
--test ptb/test.conllx \ | ||
--embed glove-6b-100 | ||
# crf2o | ||
$ python -u -m supar.cmds.dep.crf2o train -b -d 0 -c dep-crf2o-en -p model -f char \ | ||
--train ptb/train.conllx \ | ||
--dev ptb/dev.conllx \ | ||
--test ptb/test.conllx \ | ||
--embed glove-6b-100 \ | ||
--mbr \ | ||
--proj | ||
``` | ||
The option `-c` controls where to load predefined configs, you can either specify a local file path or the same short name as a pretrained model. | ||
For CRF models, you ***must*** specify `--proj` to remove non-projective trees. | ||
|
||
Specifying `--mbr` to perform MBR decoding often leads to consistent improvement. | ||
|
||
The model trained by finetuning [`robert-large`](https://huggingface.co/roberta-large) achieves nearly state-of-the-art performance in English dependency parsing. | ||
Here we provide some recommended hyper-parameters (not the best, but good enough). | ||
You are allowed to set values of registered/unregistered parameters in command lines to suppress default configs in the file. | ||
```sh | ||
$ python -u -m supar.cmds.dep.biaffine train -b -d 0 -c dep-biaffine-roberta-en -p model \ | ||
--train ptb/train.conllx \ | ||
--dev ptb/dev.conllx \ | ||
--test ptb/test.conllx \ | ||
--encoder=bert \ | ||
--bert=roberta-large \ | ||
--lr=5e-5 \ | ||
--lr-rate=20 \ | ||
--batch-size=5000 \ | ||
--epochs=10 \ | ||
--update-steps=4 | ||
``` | ||
The pretrained multilingual model `dep-biaffine-xlmr` is finetuned on [`xlm-roberta-large`](https://huggingface.co/xlm-roberta-large). | ||
The training command is: | ||
```sh | ||
$ python -u -m supar.cmds.dep.biaffine train -b -d 0 -c dep-biaffine-xlmr -p model \ | ||
--train ud2.3/train.conllx \ | ||
--dev ud2.3/dev.conllx \ | ||
--test ud2.3/test.conllx \ | ||
--encoder=bert \ | ||
--bert=xlm-roberta-large \ | ||
--lr=5e-5 \ | ||
--lr-rate=20 \ | ||
--batch-size=5000 \ | ||
--epochs=10 \ | ||
--update-steps=4 | ||
``` | ||
|
||
To evaluate: | ||
```sh | ||
# biaffine | ||
python -u -m supar.cmds.dep.biaffine evaluate -d 0 -p dep-biaffine-en --data ptb/test.conllx --tree --proj | ||
# crf2o | ||
python -u -m supar.cmds.dep.crf2o evaluate -d 0 -p dep-crf2o-en --data ptb/test.conllx --mbr --tree --proj | ||
``` | ||
`--tree` and `--proj` ensure that the output trees are well-formed and projective, respectively. | ||
|
||
The commands for training and evaluating Chinese models are similar, except that you need to specify `--punct` to include punctuation. |
Oops, something went wrong.