- 
                Notifications
    You must be signed in to change notification settings 
- Fork 622
Add hunspell token filter #8061 #8070
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add hunspell token filter #8061 #8070
Conversation
Signed-off-by: Anton Rubin <anton.rubin@eliatra.com>
| Thank you for submitting your PR. The PR states are In progress (or Draft) -> Tech review -> Doc review -> Editorial review -> Merged. Before you submit your PR for doc review, make sure the content is technically accurate. If you need help finding a tech reviewer, tag a maintainer. When you're ready for doc review, tag the assignee of this PR. The doc reviewer may push edits to the PR directly or leave comments and editorial suggestions for you to address (let us know in a comment if you have a preference). The doc reviewer will arrange for an editorial review. | 
| I did not include  Also the configuration for  Also was unable to see any difference in behaviour when adding  Also according to these docs you should be able to change the default directory for hunspell dictionaries, but I was not able to get this to work. If anyone is able to confirm if this works and what format is expected, I can update the PR accordingly | 
| @udabhas Will you see the preceding comments from the technical writer and provide your feedback? Thank you. | 
| @AntonEliatra I would enter this as a bug in the main OpenSearch repo. | 
| Bug issue added opensearch-project/OpenSearch#15417 and  | 
939fcb9    to
    b7e09d5      
    Compare
  
    Signed-off-by: Anton Rubin <anton.rubin@eliatra.com>
| @varun-lodaya The documentation is awaiting tech review and approval, which is delaying progress. Could you please suggest alternative reviewers who can assist with this task in a timely manner? We're eager to move this forward. Thank you. | 
Signed-off-by: AntonEliatra <anton.rubin@eliatra.com>
| 
 @varun-lodaya This is over a month old. We need tech review approval to move it forward in the documentation process. Please review this week or provide a peer who can review it. Thank you. | 
Signed-off-by: AntonEliatra <anton.rubin@eliatra.com>
Signed-off-by: Anton Rubin <anton.rubin@eliatra.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @AntonEliatra! A couple of suggestions.
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kolchfa-aws Please see my comments and changes and let me know if you have any questions. Thanks!
| `flatten_graph` | [FlattenGraphFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/FlattenGraphFilter.html) | Flattens a token graph produced by a graph token filter, such as `synonym_graph` or `word_delimiter_graph`, making the graph suitable for indexing. | ||
| `hunspell` | [HunspellStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/hunspell/HunspellStemFilter.html) | Uses [Hunspell](https://en.wikipedia.org/wiki/Hunspell) rules to stem tokens. Because Hunspell supports a word having multiple stems, this filter can emit multiple tokens for each consumed token. Requires you to configure one or more language-specific Hunspell dictionaries. | ||
| [`hunspell`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/hunspell/) | [HunspellStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/hunspell/HunspellStemFilter.html) | Uses [Hunspell](https://en.wikipedia.org/wiki/Hunspell) rules to stem tokens. Because Hunspell supports a word having multiple stems, this filter can emit multiple tokens for each consumed token. Requires you to configure one or more language-specific Hunspell dictionaries. | ||
| `hyphenation_decompounder` | [HyphenationCompoundWordTokenFilter](https://lucene.apache.org/core/9_8_0/analysis/common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilter.html) | Uses XML-based hyphenation patterns to find potential subwords in compound words and checks the subwords against the specified word list. The token output contains only the subwords found in the word list. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 33, second sentence: "Because Hunspell allows a word to have multiple stems"?
Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
* adding hunspell token filter #8061 Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * adding dedup and example where to download files Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * Update hunspell.md Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> * Update hunspell.md Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> * updating parameter table Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * Apply suggestions from code review Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Update _analyzers/token-filters/hunspell.md Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower <nbower@amazon.com> (cherry picked from commit 01c0d49) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…#8070) * adding hunspell token filter opensearch-project#8061 Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * adding dedup and example where to download files Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * Update hunspell.md Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> * Update hunspell.md Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> * updating parameter table Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * Apply suggestions from code review Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Update _analyzers/token-filters/hunspell.md Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: Eric Pugh <epugh@opensourceconnections.com>
…#8070) * adding hunspell token filter opensearch-project#8061 Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * adding dedup and example where to download files Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * Update hunspell.md Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> * Update hunspell.md Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> * updating parameter table Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * Apply suggestions from code review Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Update _analyzers/token-filters/hunspell.md Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: Eric Pugh <epugh@opensourceconnections.com>
Description
Add hunspell token filter
Issues Resolved
Closes #8061
Version
all
Checklist
For more information on following Developer Certificate of Origin and signing off your commits, please check here.