-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Add posTagFormat parameter for OpenNLPPOSFilter #14194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
63371f4
to
e1be5d5
Compare
Test failure:
The test failure reliably reproduces with that seed on my Mac laptop. |
This allows users to use either a Penn or UD part-of-speech tagging model, but output tags in the other format. This allows users to combine a Penn POS tagging model with a lemmatizer model trained on UD tags, for example.
e1be5d5
to
fc646dc
Compare
The test was fixed. Everything is good after rebasing to latest 👍 |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes make sense, however some more documentation would be great on the UD and PENN formats...
@cpoerschke (and anyone else) I haven't done a commit on Lucene in a long time so I want to get another set of eyes on this.. And I need to remember what all is in the workflow as well ;-) |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
Description
This allows users to use either a Penn or UD part-of-speech tagging model, but output tags in the other format. This allows users to combine a Penn POS tagging model with a lemmatizer model trained on UD tags, for example.
For a quick reference on the two:
The conversion rules are also defined in https://github.com/apache/opennlp/blob/6daacd319b95c5937abca5ef99e24566825fe89f/opennlp-tools/src/main/java/opennlp/tools/postag/POSTagFormatMapper.java#L40
This commit also changes the default POSTagFormat to
CUSTOM
(whereas I previously set it toPENN
), which just passes through the tag format from the POSTaggerModel. I believe this is a reasonable default, since new users are likely to use just the new UD models published at https://opennlp.apache.org/models.html, whereas existing users likely have Penn modelsUsers only need to specify a POSTagFormat if they have a combination of models and need to convert between UD and Penn tag formats (to convert from a POSTaggerModel in one format to a lemmatizer or chunker model in the other format).
Currently, the models used for the unit tests all use the Penn tag format. Retraining the models using the UD format can be addressed as part of #13002 (which I may work on next). To verify the downstream consumption of UD tags by another filter, I manually updated the lemmatizer dictionary (a non-binary model) to add UD tags.
Resolves #14188