Skip to content

Conversation

@ivyleavedtoadflax
Copy link
Contributor

@ivyleavedtoadflax ivyleavedtoadflax commented Mar 9, 2020

What does this PR contain

  • Adds the python -m deep_reference_parser parse command
  • Renames the predict command to python -m deep_reference_parser split
  • Bumps version to 2020.3.0
  • Improves clarity of CLI output by suppressing DeprecationWarnings caused by using keras and CRF from keras_contrib. It is currently non-trivial to fix these warnings.

How can you test it

# Create a test file:

cat > references.txt <<EOF
1 Sibbald, A, Eason, W, McAdam, J, and Hislop, A (2001). The establishment phase of a silvopastoral national network experiment in the UK. Agroforestry systems, 39, 39–53. 
2 Silva, J and Rego, F (2003). Root distribution of a Mediterranean shrubland in Portugal. Plant and Soil, 255 (2), 529–540. 
3 Sims, R, Schock, R, Adegbululgbe, A, Fenhann, J, Konstantinaviciute, I, Moomaw, W, Nimir, H, Schlamadinger, B, Torres-Martínez, J, Turner, C, Uchiyama, Y, Vuori, S, Wamukonya, N, and X. Zhang (2007). Energy Supply. In Metz, B, Davidson, O, Bosch, P, Dave, R, and Meyer, L (eds.), Climate Change 2007: Mitigation. Contribution of Working Group III to the Fourth Assessment Report of the Intergovernmental Panel on Climate Change, Cambridge University Press, Cambridge, United Kingdom and New York, NY, USA.
EOF

# Test the default model that ships with the package in a new venv

virtualenv temp && source temp/bin/activate
pip install git+git://github.com/wellcometrust/deep_reference_parser.git@feature/ivyleavedtoadflax/parsing

# Run the splitter

python -m deep_reference_parser split  "$(cat references.txt)"

# Run the parser

python -m deep_reference_parser parse "$(cat references.txt)"

@ivyleavedtoadflax ivyleavedtoadflax force-pushed the feature/ivyleavedtoadflax/parsing branch from 795d2b5 to d11d633 Compare March 9, 2020 19:23
@ivyleavedtoadflax ivyleavedtoadflax changed the title Feature/ivyleavedtoadflax/parsing Add parsing model Mar 10, 2020
@ivyleavedtoadflax ivyleavedtoadflax self-assigned this Mar 10, 2020
@ivyleavedtoadflax ivyleavedtoadflax marked this pull request as ready for review March 10, 2020 21:36
@nsorros
Copy link

nsorros commented Mar 11, 2020

virtualenv temp && source temp/bin/activate should be virtualenv -p python3 temp && source temp/bin/activate for those of us that have not updated their default python to 3

Copy link

@nsorros nsorros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good. Just waiting to run the examples. Takes some time to download the weights for some reason.

@nsorros
Copy link

nsorros commented Mar 11, 2020

What does this PR contain

  • Adds the python -m deep_reference_parser parse command
  • Renames the predict command to python -m deep_reference_parser split
  • Bumps version to 2020.3.0
  • Improves clarity of CLI output by suppressing DeprecationWarnings caused by using keras and CRF from keras_contrib. It is currently non-trivial to fix these warnings.

How can you test it

# Create a test file:

cat > references.txt <<EOF
1 Sibbald, A, Eason, W, McAdam, J, and Hislop, A (2001). The establishment phase of a silvopastoral national network experiment in the UK. Agroforestry systems, 39, 39–53. 
2 Silva, J and Rego, F (2003). Root distribution of a Mediterranean shrubland in Portugal. Plant and Soil, 255 (2), 529–540. 
3 Sims, R, Schock, R, Adegbululgbe, A, Fenhann, J, Konstantinaviciute, I, Moomaw, W, Nimir, H, Schlamadinger, B, Torres-Martínez, J, Turner, C, Uchiyama, Y, Vuori, S, Wamukonya, N, and X. Zhang (2007). Energy Supply. In Metz, B, Davidson, O, Bosch, P, Dave, R, and Meyer, L (eds.), Climate Change 2007: Mitigation. Contribution of Working Group III to the Fourth Assessment Report of the Intergovernmental Panel on Climate Change, Cambridge University Press, Cambridge, United Kingdom and New York, NY, USA.
EOF

# Test the default model that ships with the package in a new venv

virtualenv temp && source temp/bin/activate
pip install git+git://github.com/wellcometrust/deep_reference_parser.git@feature/ivyleavedtoadflax/parsing

# Run the splitter

python -m deep_reference_parser split  "$(cat references.txt)"

# Run the parser

python -m deep_reference_parser parse "$(cat references.txt)"

Running the code takes a while since it seems to download weights every time. Any way to cache them? in a location that is made clear to the user. Also the output of the split seems good for a demo but I wonder whether the outputs should just be one reference per line. Also for parse maybe a JSON or at least the tokens grouped in their categories?

@ivyleavedtoadflax
Copy link
Contributor Author

Running the code takes a while since it seems to download weights every time. Any way to cache them? in a location that is made clear to the user.

Weights are cached, it will only download if you don't have them in the right place which is specified in the config you are using. If you didn't specify a config, it uses the default ones which ship with the package. Could certainly be more clear about where it is looking for these files though.

Also the output of the split seems good for a demo but I wonder whether the outputs should just be one reference per line.

This is the default verbose output. For tokens if you specify --output <file> you will get one token per line. Let me check about references though, ideally this would be one per line too, but I can't remember if I implemented it

More generally I'm strongly considering moving away from the tsv format used by Rodrigues and presenting labels and tokens in separate files with one token/label per line. This would remove a lot of duplicated code.

Also for parse maybe a JSON or at least the tokens grouped in their categories?

Yes agree that output is sub-optimal. I'd been putting this off until the multi-task model was ready so that I could combine the splitter output with the parser output, so that we will have an idea of where a references starts and finishes.

@ivyleavedtoadflax
Copy link
Contributor Author

virtualenv temp && source temp/bin/activate should be virtualenv -p python3 temp && source temp/bin/activate for those of us that have not updated their default python to 3

👍

@ivyleavedtoadflax
Copy link
Contributor Author

ivyleavedtoadflax commented Mar 11, 2020

Also the output of the split seems good for a demo but I wonder whether the outputs should just be one reference per line.

This is the default verbose output. For tokens if you specify --output <file> you will get one token per line. Let me check about references though, ideally this would be one per line too, but I can't remember if I implemented it

Ah ok yes I remember now. So the output when --outfile is specified the file written is a valid json. Because the tokens need to be a json (they have both label and token) I'd like to keep the references in json too until we come to a decision about the tsv question. If we decide to split tokens and labels into two separate files (which is what this project does btw - which was used in Rodrigues' implementation) then lets move to a line delimited text file format for both references, tokens, and labels

@ivyleavedtoadflax
Copy link
Contributor Author

Running the code takes a while since it seems to download weights every time. Any way to cache them? in a location that is made clear to the user.

Weights are cached, it will only download if you don't have them in the right place which is specified in the config you are using. If you didn't specify a config, it uses the default ones which ship with the package. Could certainly be more clear about where it is looking for these files though.

I've added some additional feedback on where it will look for the artefacts:

ℹ Using config file:                                   
/home/matthew/Documents/wellcome/deep_reference_parser/deep_reference_parser/configs/2020.3.2_parsing.ini
ℹ Attempting to download model artefacts if they are not found locally                                                                                
in models/parsing/2020.3.2_parsing/. This may take some time...                                                 
✔ Found models/parsing/2020.3.2_parsing/char2ind.pickle                                                                                               
✔ Found models/parsing/2020.3.2_parsing/ind2label.pickle                           
✔ Found models/parsing/2020.3.2_parsing/ind2word.pickle                                                                                               
✔ Found models/parsing/2020.3.2_parsing/label2ind.pickle                                                         
✔ Found models/parsing/2020.3.2_parsing/maxes.pickle                                                                                                  
✔ Found models/parsing/2020.3.2_parsing/weights.h5                                 
✔ Found models/parsing/2020.3.2_parsing/word2ind.pickle                                                                                               
✔ Found embeddings/2020.1.1-wellcome-embeddings-300.txt 

@ivyleavedtoadflax ivyleavedtoadflax merged commit d899489 into master Mar 11, 2020
@ivyleavedtoadflax ivyleavedtoadflax deleted the feature/ivyleavedtoadflax/parsing branch March 11, 2020 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants