Skip to content

Add posTagFormat parameter for OpenNLPPOSFilter #14194

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

msfroh
Copy link
Contributor

@msfroh msfroh commented Feb 4, 2025

Description

This allows users to use either a Penn or UD part-of-speech tagging model, but output tags in the other format. This allows users to combine a Penn POS tagging model with a lemmatizer model trained on UD tags, for example.

For a quick reference on the two:

The conversion rules are also defined in https://github.com/apache/opennlp/blob/6daacd319b95c5937abca5ef99e24566825fe89f/opennlp-tools/src/main/java/opennlp/tools/postag/POSTagFormatMapper.java#L40

This commit also changes the default POSTagFormat to CUSTOM (whereas I previously set it to PENN), which just passes through the tag format from the POSTaggerModel. I believe this is a reasonable default, since new users are likely to use just the new UD models published at https://opennlp.apache.org/models.html, whereas existing users likely have Penn models

Users only need to specify a POSTagFormat if they have a combination of models and need to convert between UD and Penn tag formats (to convert from a POSTaggerModel in one format to a lemmatizer or chunker model in the other format).

Currently, the models used for the unit tests all use the Penn tag format. Retraining the models using the UD format can be addressed as part of #13002 (which I may work on next). To verify the downstream consumption of UD tags by another filter, I manually updated the lemmatizer dictionary (a non-binary model) to add UD tags.

Resolves #14188

@msfroh msfroh force-pushed the opennlp_ud_postags branch 2 times, most recently from 63371f4 to e1be5d5 Compare February 4, 2025 18:47
@msfroh
Copy link
Contributor Author

msfroh commented Feb 4, 2025

Test failure:

Reproduce with: gradlew :lucene:core:test --tests "org.apache.lucene.search.TestSeededKnnByteVectorQuery.testSeedWithTimeout" -Ptests.jvms=1 -Ptests.jvmargs= -Ptests.seed=533751C89ADE5499 -Ptests.useSecurityManager=true -Ptests.gui=true -Ptests.file.encoding=UTF-8 -Ptests.vectorsize=256 -Ptests.forceintegervectors=true

The test failure reliably reproduces with that seed on my Mac laptop.

This allows users to use either a Penn or UD part-of-speech tagging
model, but output tags in the other format. This allows users to
combine a Penn POS tagging model with a lemmatizer model trained on UD
tags, for example.
@msfroh msfroh force-pushed the opennlp_ud_postags branch from e1be5d5 to fc646dc Compare February 10, 2025 18:07
@msfroh
Copy link
Contributor Author

msfroh commented Feb 10, 2025

Test failure:

Reproduce with: gradlew :lucene:core:test --tests "org.apache.lucene.search.TestSeededKnnByteVectorQuery.testSeedWithTimeout" -Ptests.jvms=1 -Ptests.jvmargs= -Ptests.seed=533751C89ADE5499 -Ptests.useSecurityManager=true -Ptests.gui=true -Ptests.file.encoding=UTF-8 -Ptests.vectorsize=256 -Ptests.forceintegervectors=true

The test failure reliably reproduces with that seed on my Mac laptop.

The test was fixed. Everything is good after rebasing to latest 👍

Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Feb 25, 2025
Copy link
Contributor

@epugh epugh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes make sense, however some more documentation would be great on the UD and PENN formats...

@github-actions github-actions bot removed the Stale label Feb 27, 2025
@epugh
Copy link
Contributor

epugh commented Feb 27, 2025

@cpoerschke (and anyone else) I haven't done a commit on Lucene in a long time so I want to get another set of eyes on this.. And I need to remember what all is in the workflow as well ;-)

Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Mar 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Benefit from OpenNLP's new UD models
2 participants