Skip to content

Ukrainian language plugin can fill up heap #71998

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 21, 2021

Conversation

romseygeek
Copy link
Contributor

The lucene Ukrainian analyzer has a bug where a large in-memory
dictionary is loaded and stored on a thread local for every tokenstream
generated in a new thread (for more details see
https://issues.apache.org/jira/browse/LUCENE-9930). Due to checks
added in #50908, we create a tokenstream for every registered
analyzer in every shard, which means that any node with the ukrainian
plugin installed will leak one copy of this dictionary for every shard,
whether or not the ukrainian analyzer is actually being used.

This commit makes the plugin use a fixed version of the
UkrainianMorfologikAnalyzer, until we merge a version of lucene that
contains the upstream fix.

@romseygeek romseygeek requested a review from jpountz April 21, 2021 09:00
@romseygeek romseygeek self-assigned this Apr 21, 2021
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Apr 21, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@romseygeek romseygeek merged commit 993f0b0 into elastic:master Apr 21, 2021
@romseygeek romseygeek deleted the bug/ukrainian-analyzer branch April 21, 2021 11:13
romseygeek added a commit that referenced this pull request Apr 21, 2021
The lucene Ukrainian analyzer has a bug where a large in-memory
dictionary is loaded and stored on a thread local for every tokenstream
generated in a new thread (for more details see
https://issues.apache.org/jira/browse/LUCENE-9930). Due to checks
added in #50908, we create a tokenstream for every registered
analyzer in every shard, which means that any node with the ukrainian
plugin installed will leak one copy of this dictionary per shard,
whether or not the ukrainian analyzer is actually being used.

This commit makes the plugin use a fixed version of the
UkrainianMorfologikAnalyzer, until we merge a version of lucene that
contains the upstream fix.
@ppf2
Copy link
Contributor

ppf2 commented May 26, 2021

@romseygeek Is the version label correct in this PR? It's not listed in the release notes (https://www.elastic.co/guide/en/elasticsearch/reference/current/release-notes-7.13.0.html). If this didn't make it to 7.13.0, will it be in 7.13.1? Thx!

@romseygeek
Copy link
Contributor Author

Not sure why it's not in the release notes, but it's in the 7.13 release: d6038a3

ppf2 added a commit that referenced this pull request May 26, 2021
#71998 was fixed in 7.13.0 but it is missing from the release notes.
@ppf2
Copy link
Contributor

ppf2 commented May 26, 2021

Thx for confirming @romseygeek ! I have filed a doc PR to add it (#73440).

jrodewig pushed a commit that referenced this pull request May 26, 2021
#71998 was fixed in 7.13.0 but was missed in the release notes.
jrodewig added a commit that referenced this pull request May 26, 2021
#71998 was fixed in 7.13.0 but was missed in the release notes.

Co-authored-by: Pius <pius@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants