Add parsing model #10

ivyleavedtoadflax · 2020-03-09T19:19:30Z

What does this PR contain

Adds the python -m deep_reference_parser parse command
Renames the predict command to python -m deep_reference_parser split
Bumps version to 2020.3.0
Improves clarity of CLI output by suppressing DeprecationWarnings caused by using keras and CRF from keras_contrib. It is currently non-trivial to fix these warnings.

How can you test it

# Create a test file:

cat > references.txt <<EOF
1 Sibbald, A, Eason, W, McAdam, J, and Hislop, A (2001). The establishment phase of a silvopastoral national network experiment in the UK. Agroforestry systems, 39, 39–53. 
2 Silva, J and Rego, F (2003). Root distribution of a Mediterranean shrubland in Portugal. Plant and Soil, 255 (2), 529–540. 
3 Sims, R, Schock, R, Adegbululgbe, A, Fenhann, J, Konstantinaviciute, I, Moomaw, W, Nimir, H, Schlamadinger, B, Torres-Martínez, J, Turner, C, Uchiyama, Y, Vuori, S, Wamukonya, N, and X. Zhang (2007). Energy Supply. In Metz, B, Davidson, O, Bosch, P, Dave, R, and Meyer, L (eds.), Climate Change 2007: Mitigation. Contribution of Working Group III to the Fourth Assessment Report of the Intergovernmental Panel on Climate Change, Cambridge University Press, Cambridge, United Kingdom and New York, NY, USA.
EOF

# Test the default model that ships with the package in a new venv

virtualenv temp && source temp/bin/activate
pip install git+git://github.com/wellcometrust/deep_reference_parser.git@feature/ivyleavedtoadflax/parsing

# Run the splitter

python -m deep_reference_parser split  "$(cat references.txt)"

# Run the parser

python -m deep_reference_parser parse "$(cat references.txt)"

If --output|-o is set then the output will be written to a json file

nsorros · 2020-03-11T09:54:36Z

virtualenv temp && source temp/bin/activate should be virtualenv -p python3 temp && source temp/bin/activate for those of us that have not updated their default python to 3

tests/prodigy/test_prodigy_entrypoints.py

deep_reference_parser/split.py

deep_reference_parser/parse.py

nsorros

Code looks good. Just waiting to run the examples. Takes some time to download the weights for some reason.

nsorros · 2020-03-11T10:26:59Z

What does this PR contain

Adds the python -m deep_reference_parser parse command
Renames the predict command to python -m deep_reference_parser split
Bumps version to 2020.3.0
Improves clarity of CLI output by suppressing DeprecationWarnings caused by using keras and CRF from keras_contrib. It is currently non-trivial to fix these warnings.

How can you test it

# Create a test file:

cat > references.txt <<EOF
1 Sibbald, A, Eason, W, McAdam, J, and Hislop, A (2001). The establishment phase of a silvopastoral national network experiment in the UK. Agroforestry systems, 39, 39–53. 
2 Silva, J and Rego, F (2003). Root distribution of a Mediterranean shrubland in Portugal. Plant and Soil, 255 (2), 529–540. 
3 Sims, R, Schock, R, Adegbululgbe, A, Fenhann, J, Konstantinaviciute, I, Moomaw, W, Nimir, H, Schlamadinger, B, Torres-Martínez, J, Turner, C, Uchiyama, Y, Vuori, S, Wamukonya, N, and X. Zhang (2007). Energy Supply. In Metz, B, Davidson, O, Bosch, P, Dave, R, and Meyer, L (eds.), Climate Change 2007: Mitigation. Contribution of Working Group III to the Fourth Assessment Report of the Intergovernmental Panel on Climate Change, Cambridge University Press, Cambridge, United Kingdom and New York, NY, USA.
EOF

# Test the default model that ships with the package in a new venv

virtualenv temp && source temp/bin/activate
pip install git+git://github.com/wellcometrust/deep_reference_parser.git@feature/ivyleavedtoadflax/parsing

# Run the splitter

python -m deep_reference_parser split  "$(cat references.txt)"

# Run the parser

python -m deep_reference_parser parse "$(cat references.txt)"

Running the code takes a while since it seems to download weights every time. Any way to cache them? in a location that is made clear to the user. Also the output of the split seems good for a demo but I wonder whether the outputs should just be one reference per line. Also for parse maybe a JSON or at least the tokens grouped in their categories?

ivyleavedtoadflax · 2020-03-11T13:16:24Z

Running the code takes a while since it seems to download weights every time. Any way to cache them? in a location that is made clear to the user.

Weights are cached, it will only download if you don't have them in the right place which is specified in the config you are using. If you didn't specify a config, it uses the default ones which ship with the package. Could certainly be more clear about where it is looking for these files though.

Also the output of the split seems good for a demo but I wonder whether the outputs should just be one reference per line.

This is the default verbose output. For tokens if you specify --output <file> you will get one token per line. Let me check about references though, ideally this would be one per line too, but I can't remember if I implemented it

More generally I'm strongly considering moving away from the tsv format used by Rodrigues and presenting labels and tokens in separate files with one token/label per line. This would remove a lot of duplicated code.

Also for parse maybe a JSON or at least the tokens grouped in their categories?

Yes agree that output is sub-optimal. I'd been putting this off until the multi-task model was ready so that I could combine the splitter output with the parser output, so that we will have an idea of where a references starts and finishes.

ivyleavedtoadflax · 2020-03-11T13:17:55Z

virtualenv temp && source temp/bin/activate should be virtualenv -p python3 temp && source temp/bin/activate for those of us that have not updated their default python to 3

👍

ivyleavedtoadflax · 2020-03-11T13:26:10Z

Also the output of the split seems good for a demo but I wonder whether the outputs should just be one reference per line.

This is the default verbose output. For tokens if you specify --output <file> you will get one token per line. Let me check about references though, ideally this would be one per line too, but I can't remember if I implemented it

Ah ok yes I remember now. So the output when --outfile is specified the file written is a valid json. Because the tokens need to be a json (they have both label and token) I'd like to keep the references in json too until we come to a decision about the tsv question. If we decide to split tokens and labels into two separate files (which is what this project does btw - which was used in Rodrigues' implementation) then lets move to a line delimited text file format for both references, tokens, and labels

ivyleavedtoadflax · 2020-03-11T13:38:11Z

Running the code takes a while since it seems to download weights every time. Any way to cache them? in a location that is made clear to the user.

Weights are cached, it will only download if you don't have them in the right place which is specified in the config you are using. If you didn't specify a config, it uses the default ones which ship with the package. Could certainly be more clear about where it is looking for these files though.

I've added some additional feedback on where it will look for the artefacts:

ℹ Using config file:                                   
/home/matthew/Documents/wellcome/deep_reference_parser/deep_reference_parser/configs/2020.3.2_parsing.ini
ℹ Attempting to download model artefacts if they are not found locally                                                                                
in models/parsing/2020.3.2_parsing/. This may take some time...                                                 
✔ Found models/parsing/2020.3.2_parsing/char2ind.pickle                                                                                               
✔ Found models/parsing/2020.3.2_parsing/ind2label.pickle                           
✔ Found models/parsing/2020.3.2_parsing/ind2word.pickle                                                                                               
✔ Found models/parsing/2020.3.2_parsing/label2ind.pickle                                                         
✔ Found models/parsing/2020.3.2_parsing/maxes.pickle                                                                                                  
✔ Found models/parsing/2020.3.2_parsing/weights.h5                                 
✔ Found models/parsing/2020.3.2_parsing/word2ind.pickle                                                                                               
✔ Found embeddings/2020.1.1-wellcome-embeddings-300.txt

ivyleavedtoadflax added 5 commits March 9, 2020 15:57

chg: Move predict to split command

eb966ce

new: Add pars command

ae2d760

chg: linting 💄

b1e6e5b

chg: Update tests

6aaccd1

chg: Update README.md

d11d633

ivyleavedtoadflax force-pushed the feature/ivyleavedtoadflax/parsing branch from 795d2b5 to d11d633 Compare March 9, 2020 19:23

ivyleavedtoadflax added 7 commits March 9, 2020 20:40

new: Squash tf and keras warnings

063e22a

chg: Reduce verbosity of keras logging

81479bb

chg: Default behaviour of split and parse is verbose

c9e0321

If --output|-o is set then the output will be written to a json file

chg: Demote write_pickle logs to debug

4bcf002

chg: Update Makefile data recipe

e1cd8b6

new: Add CHANGELOG.md

90b3a47

chg: Update README.md

f6e85c2

ivyleavedtoadflax changed the title ~~Feature/ivyleavedtoadflax/parsing~~ Add parsing model Mar 10, 2020

ivyleavedtoadflax self-assigned this Mar 10, 2020

ivyleavedtoadflax marked this pull request as ready for review March 10, 2020 21:36

ivyleavedtoadflax requested review from aCampello, lizgzil and nsorros March 10, 2020 21:36

chg: Update README.md

360e3f8

nsorros reviewed Mar 11, 2020

View reviewed changes

tests/prodigy/test_prodigy_entrypoints.py Outdated Show resolved Hide resolved

nsorros reviewed Mar 11, 2020

View reviewed changes

deep_reference_parser/split.py Show resolved Hide resolved

nsorros reviewed Mar 11, 2020

View reviewed changes

deep_reference_parser/parse.py Show resolved Hide resolved

nsorros reviewed Mar 11, 2020

View reviewed changes

nsorros approved these changes Mar 11, 2020

View reviewed changes

new: Feedback to console on artefact paths

8ea7462

ivyleavedtoadflax added 3 commits March 11, 2020 12:41

fix: Typo in numbered_reference_annotator

c6aee2a

new: Add prodigy entrypoint tests

2e0bf61

chg: linting with black 💄⬛

b16a1a1

ivyleavedtoadflax merged commit d899489 into master Mar 11, 2020

ivyleavedtoadflax deleted the feature/ivyleavedtoadflax/parsing branch March 11, 2020 15:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add parsing model #10

Add parsing model #10

Uh oh!

ivyleavedtoadflax commented Mar 9, 2020 •

edited

Loading

Uh oh!

nsorros commented Mar 11, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nsorros left a comment

Uh oh!

nsorros commented Mar 11, 2020 •

edited

Loading

What does this PR contain

How can you test it

Uh oh!

ivyleavedtoadflax commented Mar 11, 2020

Uh oh!

ivyleavedtoadflax commented Mar 11, 2020

Uh oh!

ivyleavedtoadflax commented Mar 11, 2020 •

edited

Loading

Uh oh!

ivyleavedtoadflax commented Mar 11, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add parsing model #10

Add parsing model #10

Uh oh!

Conversation

ivyleavedtoadflax commented Mar 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR contain

How can you test it

Uh oh!

nsorros commented Mar 11, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nsorros left a comment

Choose a reason for hiding this comment

Uh oh!

nsorros commented Mar 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR contain

How can you test it

Uh oh!

ivyleavedtoadflax commented Mar 11, 2020

Uh oh!

ivyleavedtoadflax commented Mar 11, 2020

Uh oh!

ivyleavedtoadflax commented Mar 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivyleavedtoadflax commented Mar 11, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ivyleavedtoadflax commented Mar 9, 2020 •

edited

Loading

nsorros commented Mar 11, 2020 •

edited

Loading

ivyleavedtoadflax commented Mar 11, 2020 •

edited

Loading