Add posTagFormat parameter for OpenNLPPOSFilter #14194

msfroh · 2025-02-04T17:41:22Z

Description

This allows users to use either a Penn or UD part-of-speech tagging model, but output tags in the other format. This allows users to combine a Penn POS tagging model with a lemmatizer model trained on UD tags, for example.

For a quick reference on the two:

The conversion rules are also defined in https://github.com/apache/opennlp/blob/6daacd319b95c5937abca5ef99e24566825fe89f/opennlp-tools/src/main/java/opennlp/tools/postag/POSTagFormatMapper.java#L40

This commit also changes the default POSTagFormat to CUSTOM (whereas I previously set it to PENN), which just passes through the tag format from the POSTaggerModel. I believe this is a reasonable default, since new users are likely to use just the new UD models published at https://opennlp.apache.org/models.html, whereas existing users likely have Penn models

Users only need to specify a POSTagFormat if they have a combination of models and need to convert between UD and Penn tag formats (to convert from a POSTaggerModel in one format to a lemmatizer or chunker model in the other format).

Currently, the models used for the unit tests all use the Penn tag format. Retraining the models using the UD format can be addressed as part of #13002 (which I may work on next). To verify the downstream consumption of UD tags by another filter, I manually updated the lemmatizer dictionary (a non-binary model) to add UD tags.

Resolves #14188

msfroh · 2025-02-04T21:07:46Z

Test failure:

Reproduce with: gradlew :lucene:core:test --tests "org.apache.lucene.search.TestSeededKnnByteVectorQuery.testSeedWithTimeout" -Ptests.jvms=1 -Ptests.jvmargs= -Ptests.seed=533751C89ADE5499 -Ptests.useSecurityManager=true -Ptests.gui=true -Ptests.file.encoding=UTF-8 -Ptests.vectorsize=256 -Ptests.forceintegervectors=true

The test failure reliably reproduces with that seed on my Mac laptop.

This allows users to use either a Penn or UD part-of-speech tagging model, but output tags in the other format. This allows users to combine a Penn POS tagging model with a lemmatizer model trained on UD tags, for example.

msfroh · 2025-02-10T18:50:06Z

Test failure:

Reproduce with: gradlew :lucene:core:test --tests "org.apache.lucene.search.TestSeededKnnByteVectorQuery.testSeedWithTimeout" -Ptests.jvms=1 -Ptests.jvmargs= -Ptests.seed=533751C89ADE5499 -Ptests.useSecurityManager=true -Ptests.gui=true -Ptests.file.encoding=UTF-8 -Ptests.vectorsize=256 -Ptests.forceintegervectors=true

The test failure reliably reproduces with that seed on my Mac laptop.

The test was fixed. Everything is good after rebasing to latest 👍

github-actions · 2025-02-25T00:23:38Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

epugh

The changes make sense, however some more documentation would be great on the UD and PENN formats...

epugh · 2025-02-27T15:31:07Z

@cpoerschke (and anyone else) I haven't done a commit on Lucene in a long time so I want to get another set of eyes on this.. And I need to remember what all is in the workflow as well ;-)

github-actions · 2025-03-14T00:23:28Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

github-actions bot added the module:analysis label Feb 4, 2025

msfroh force-pushed the opennlp_ud_postags branch 2 times, most recently from 63371f4 to e1be5d5 Compare February 4, 2025 18:47

Add posTagFormat parameter for OpenNLPPOSFilter

fc646dc

This allows users to use either a Penn or UD part-of-speech tagging model, but output tags in the other format. This allows users to combine a Penn POS tagging model with a lemmatizer model trained on UD tags, for example.

msfroh force-pushed the opennlp_ud_postags branch from e1be5d5 to fc646dc Compare February 10, 2025 18:07

github-actions bot added the Stale label Feb 25, 2025

epugh reviewed Feb 26, 2025

View reviewed changes

github-actions bot removed the Stale label Feb 27, 2025

github-actions bot added the Stale label Mar 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add posTagFormat parameter for OpenNLPPOSFilter #14194

Add posTagFormat parameter for OpenNLPPOSFilter #14194

Uh oh!

msfroh commented Feb 4, 2025

Uh oh!

msfroh commented Feb 4, 2025

Uh oh!

msfroh commented Feb 10, 2025

Uh oh!

github-actions bot commented Feb 25, 2025

Uh oh!

epugh left a comment

Uh oh!

epugh commented Feb 27, 2025

Uh oh!

github-actions bot commented Mar 14, 2025

Uh oh!

Uh oh!

Add posTagFormat parameter for OpenNLPPOSFilter #14194

Are you sure you want to change the base?

Add posTagFormat parameter for OpenNLPPOSFilter #14194

Uh oh!

Conversation

msfroh commented Feb 4, 2025

Description

Uh oh!

msfroh commented Feb 4, 2025

Uh oh!

msfroh commented Feb 10, 2025

Uh oh!

github-actions bot commented Feb 25, 2025

Uh oh!

epugh left a comment

Choose a reason for hiding this comment

Uh oh!

epugh commented Feb 27, 2025

Uh oh!

github-actions bot commented Mar 14, 2025

Uh oh!

Uh oh!