Add hunspell token filter #8061 #8070

AntonEliatra · 2024-08-22T14:23:29Z

Description

Add hunspell token filter

Issues Resolved

Closes #8061

Version

all

Checklist

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Anton Rubin <anton.rubin@eliatra.com>

github-actions · 2024-08-22T14:23:41Z

Thank you for submitting your PR. The PR states are In progress (or Draft) -> Tech review -> Doc review -> Editorial review -> Merged.

Before you submit your PR for doc review, make sure the content is technically accurate. If you need help finding a tech reviewer, tag a maintainer.

When you're ready for doc review, tag the assignee of this PR. The doc reviewer may push edits to the PR directly or leave comments and editorial suggestions for you to address (let us know in a comment if you have a preference). The doc reviewer will arrange for an editorial review.

AntonEliatra · 2024-08-22T14:27:14Z

I did not include dedup parameter, as it does not seem to work. The duplicates are always returned.

Also the configuration for indices.analysis.hunspell.dictionary.ignore_case, does not seem to have any impact.

Also was unable to see any difference in behaviour when adding indices.analysis.hunspell.dictionary.lazy: true
If there is a difference I can add it back in.

Also according to these docs you should be able to change the default directory for hunspell dictionaries, but I was not able to get this to work. If anyone is able to confirm if this works and what format is expected, I can update the PR accordingly

vagimeli · 2024-08-22T22:25:50Z

@udabhas Will you see the preceding comments from the technical writer and provide your feedback? Thank you.

kolchfa-aws · 2024-08-23T17:42:29Z

@AntonEliatra I would enter this as a bug in the main OpenSearch repo.

AntonEliatra · 2024-08-26T12:49:44Z

Bug issue added opensearch-project/OpenSearch#15417

and dedup parameter added to the PR

Signed-off-by: Anton Rubin <anton.rubin@eliatra.com>

vagimeli · 2024-08-28T13:47:22Z

@varun-lodaya The documentation is awaiting tech review and approval, which is delaying progress. Could you please suggest alternative reviewers who can assist with this task in a timely manner? We're eager to move this forward. Thank you.

Signed-off-by: AntonEliatra <anton.rubin@eliatra.com>

vagimeli · 2024-10-03T16:08:34Z

@varun-lodaya The documentation is awaiting tech review and approval, which is delaying progress. Could you please suggest alternative reviewers who can assist with this task in a timely manner? We're eager to move this forward. Thank you.

@varun-lodaya This is over a month old. We need tech review approval to move it forward in the documentation process. Please review this week or provide a peer who can review it. Thank you.

Signed-off-by: AntonEliatra <anton.rubin@eliatra.com>

Signed-off-by: Anton Rubin <anton.rubin@eliatra.com>

kolchfa-aws

Thank you, @AntonEliatra! A couple of suggestions.

_analyzers/token-filters/hunspell.md

Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

natebower

@kolchfa-aws Please see my comments and changes and let me know if you have any questions. Thanks!

_analyzers/token-filters/hunspell.md

_analyzers/token-filters/index.md

natebower · 2024-11-14T20:31:09Z

_analyzers/token-filters/index.md

 `flatten_graph` | [FlattenGraphFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/FlattenGraphFilter.html) | Flattens a token graph produced by a graph token filter, such as `synonym_graph` or `word_delimiter_graph`, making the graph suitable for indexing.
-`hunspell` | [HunspellStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/hunspell/HunspellStemFilter.html) | Uses [Hunspell](https://en.wikipedia.org/wiki/Hunspell) rules to stem tokens. Because Hunspell supports a word having multiple stems, this filter can emit multiple tokens for each consumed token. Requires you to configure one or more language-specific Hunspell dictionaries.
+[`hunspell`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/hunspell/) | [HunspellStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/hunspell/HunspellStemFilter.html) | Uses [Hunspell](https://en.wikipedia.org/wiki/Hunspell) rules to stem tokens. Because Hunspell supports a word having multiple stems, this filter can emit multiple tokens for each consumed token. Requires you to configure one or more language-specific Hunspell dictionaries.
 `hyphenation_decompounder` | [HyphenationCompoundWordTokenFilter](https://lucene.apache.org/core/9_8_0/analysis/common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilter.html) | Uses XML-based hyphenation patterns to find potential subwords in compound words and checks the subwords against the specified word list. The token output contains only the subwords found in the word list.


Line 33, second sentence: "Because Hunspell allows a word to have multiple stems"?

Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

_analyzers/token-filters/hunspell.md

Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

* adding hunspell token filter #8061 Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * adding dedup and example where to download files Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * Update hunspell.md Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> * Update hunspell.md Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> * updating parameter table Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * Apply suggestions from code review Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Update _analyzers/token-filters/hunspell.md Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower <nbower@amazon.com> (cherry picked from commit 01c0d49) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

…#8070) * adding hunspell token filter opensearch-project#8061 Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * adding dedup and example where to download files Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * Update hunspell.md Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> * Update hunspell.md Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> * updating parameter table Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * Apply suggestions from code review Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Update _analyzers/token-filters/hunspell.md Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: Eric Pugh <epugh@opensourceconnections.com>

adding hunspell token filter opensearch-project#8061

b7e09d5

Signed-off-by: Anton Rubin <anton.rubin@eliatra.com>

AntonEliatra requested review from AMoo-Miki, Naarcha-AWS, dlvenable, epugh, kolchfa-aws, natebower, stephen-crawford and vagimeli as code owners August 22, 2024 14:23

github-actions bot assigned kolchfa-aws Aug 22, 2024

kolchfa-aws assigned vagimeli and unassigned kolchfa-aws Aug 23, 2024

AntonEliatra force-pushed the adding-hunspell-token-filter-docs branch from 939fcb9 to b7e09d5 Compare August 26, 2024 15:20

adding dedup and example where to download files

827e908

Signed-off-by: Anton Rubin <anton.rubin@eliatra.com>

vagimeli added the Tech review PR: Tech review in progress label Aug 27, 2024

vagimeli added the Needs SME label Aug 29, 2024

Update hunspell.md

47d7d1a

Signed-off-by: AntonEliatra <anton.rubin@eliatra.com>

vagimeli added Content gap analyzers labels Sep 30, 2024

AntonEliatra added 2 commits October 9, 2024 10:23

Update hunspell.md

374109e

Signed-off-by: AntonEliatra <anton.rubin@eliatra.com>

updating parameter table

e12f307

Signed-off-by: Anton Rubin <anton.rubin@eliatra.com>

kolchfa-aws approved these changes Nov 11, 2024

View reviewed changes

_analyzers/token-filters/hunspell.md Outdated Show resolved Hide resolved

_analyzers/token-filters/hunspell.md Outdated Show resolved Hide resolved

_analyzers/token-filters/hunspell.md Outdated Show resolved Hide resolved

_analyzers/token-filters/hunspell.md Outdated Show resolved Hide resolved

kolchfa-aws self-assigned this Nov 11, 2024

kolchfa-aws unassigned vagimeli Nov 11, 2024

kolchfa-aws added 2 commits November 14, 2024 15:11

Apply suggestions from code review

fdb1002

Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

Merge branch 'main' into adding-hunspell-token-filter-docs

72b32d5

kolchfa-aws added Editorial review PR: Editorial review in progress and removed Tech review PR: Tech review in progress labels Nov 14, 2024

natebower reviewed Nov 14, 2024

View reviewed changes

kolchfa-aws and others added 2 commits November 14, 2024 15:34

Apply suggestions from code review

9e63389

Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

Merge branch 'main' into adding-hunspell-token-filter-docs

99a6b39

natebower reviewed Nov 14, 2024

View reviewed changes

_analyzers/token-filters/hunspell.md Outdated Show resolved Hide resolved

Update _analyzers/token-filters/hunspell.md

abbf175

Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

kolchfa-aws merged commit 01c0d49 into opensearch-project:main Nov 14, 2024
5 checks passed

kolchfa-aws added the backport 2.18 PR: Backport label for 2.18 label Nov 14, 2024

opensearch-trigger-bot bot mentioned this pull request Nov 14, 2024

[Backport 2.18] Add hunspell token filter #8061 #8753

Merged

github-actions bot pushed a commit that referenced this pull request Nov 14, 2024

Add hunspell token filter #8061 (#8070) (#8753)

7ccdb50

AntonEliatra deleted the adding-hunspell-token-filter-docs branch April 23, 2025 08:59

Add hunspell token filter #8061 #8070

Add hunspell token filter #8061 #8070

Uh oh!

Conversation

AntonEliatra commented Aug 22, 2024

Description

Issues Resolved

Version

Checklist

Uh oh!

github-actions bot commented Aug 22, 2024

Uh oh!

AntonEliatra commented Aug 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vagimeli commented Aug 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kolchfa-aws commented Aug 23, 2024

Uh oh!

AntonEliatra commented Aug 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vagimeli commented Aug 28, 2024

Uh oh!

vagimeli commented Oct 3, 2024

Uh oh!

kolchfa-aws left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

natebower left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

natebower Nov 14, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AntonEliatra commented Aug 22, 2024 •

edited

Loading

vagimeli commented Aug 22, 2024 •

edited

Loading

AntonEliatra commented Aug 26, 2024 •

edited

Loading