-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Ukrainian language plugin can fill up heap #71998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pinging @elastic/es-search (Team:Search) |
The lucene Ukrainian analyzer has a bug where a large in-memory dictionary is loaded and stored on a thread local for every tokenstream generated in a new thread (for more details see https://issues.apache.org/jira/browse/LUCENE-9930). Due to checks added in #50908, we create a tokenstream for every registered analyzer in every shard, which means that any node with the ukrainian plugin installed will leak one copy of this dictionary per shard, whether or not the ukrainian analyzer is actually being used. This commit makes the plugin use a fixed version of the UkrainianMorfologikAnalyzer, until we merge a version of lucene that contains the upstream fix.
@romseygeek Is the version label correct in this PR? It's not listed in the release notes (https://www.elastic.co/guide/en/elasticsearch/reference/current/release-notes-7.13.0.html). If this didn't make it to 7.13.0, will it be in 7.13.1? Thx! |
Not sure why it's not in the release notes, but it's in the 7.13 release: d6038a3 |
#71998 was fixed in 7.13.0 but it is missing from the release notes.
Thx for confirming @romseygeek ! I have filed a doc PR to add it (#73440). |
#71998 was fixed in 7.13.0 but was missed in the release notes.
#71998 was fixed in 7.13.0 but was missed in the release notes. Co-authored-by: Pius <pius@elastic.co>
The lucene Ukrainian analyzer has a bug where a large in-memory
dictionary is loaded and stored on a thread local for every tokenstream
generated in a new thread (for more details see
https://issues.apache.org/jira/browse/LUCENE-9930). Due to checks
added in #50908, we create a tokenstream for every registered
analyzer in every shard, which means that any node with the ukrainian
plugin installed will leak one copy of this dictionary for every shard,
whether or not the ukrainian analyzer is actually being used.
This commit makes the plugin use a fixed version of the
UkrainianMorfologikAnalyzer, until we merge a version of lucene that
contains the upstream fix.