-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Add CaseFolding.fold(), inverse of expand(), move to UnicodeUtil, add filter #14389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… filter Regexp has the ability to erase case differences at query time (the slow way), but there's no corresponding ability to do it the fast-way: at index time. There's LowerCaseFilter, but LowerCaseFilter normalizes text for display purposes, which is different than case folding which eliminates case differences and is appropriate for search. Generate fold() data in a similar way as expand() data. Expose via UnicodeUtil and tableize basic latin for performance. Add CaseFoldingFilter. No Analyzer chains have been modified yet, but we should be able to improve Unicode support by swapping out LowerCaseFilter as a followup. Some filters such as GreekLowerCaseFilter can probably be eliminated.
This is great; helps me progress some of the regex work in ES for why I started that CaseFolding work. Thanks for iterating on this @rmuir. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a contributor but for what it's worth I took a look at the PR (was tracking your PR's here as well @rmuir) and ran a couple of additional tests on characters I would expect to fold. Love the shift to generating the switch/case statements. lgtm
Nevermind, you already mentioned it over there. Thanks a lot! |
I will straighten out the build, this one is kinda draftish as it needs more tests etc. just wanted to toss out the idea. If it is autogenerated we can easily maintain some cohesive story rather than crazy Unicode puzzles. It is tempting to want full case folding as that's a benefit to eg German, but we need to step. Perf gets more complex, etc. Simple is an improvement over lowercasing. The goal here is to not regress indexing performance if users switch from lowercase to simple case folding. |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
Regexp has the ability to erase case differences at query time (the slow way), but there's no corresponding ability to do it the fast-way: at index time.
There's LowerCaseFilter, but LowerCaseFilter normalizes text for display purposes, which is different than case folding which eliminates case differences and is appropriate for search.
Generate fold() data in a similar way as expand() data. Expose via UnicodeUtil and tableize basic latin for performance. Add CaseFoldingFilter.
No Analyzer chains have been modified yet, but we should be able to improve Unicode support by swapping out LowerCaseFilter as a followup. Some filters such as GreekLowerCaseFilter can probably be eliminated.