Skip to content

Add CaseFolding.fold(), inverse of expand(), move to UnicodeUtil, add filter #14389

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

rmuir
Copy link
Member

@rmuir rmuir commented Mar 21, 2025

Regexp has the ability to erase case differences at query time (the slow way), but there's no corresponding ability to do it the fast-way: at index time.

There's LowerCaseFilter, but LowerCaseFilter normalizes text for display purposes, which is different than case folding which eliminates case differences and is appropriate for search.

Generate fold() data in a similar way as expand() data. Expose via UnicodeUtil and tableize basic latin for performance. Add CaseFoldingFilter.

No Analyzer chains have been modified yet, but we should be able to improve Unicode support by swapping out LowerCaseFilter as a followup. Some filters such as GreekLowerCaseFilter can probably be eliminated.

… filter

Regexp has the ability to erase case differences at query time (the slow
way), but there's no corresponding ability to do it the fast-way: at
index time.

There's LowerCaseFilter, but LowerCaseFilter normalizes text for display
purposes, which is different than case folding which eliminates case
differences and is appropriate for search.

Generate fold() data in a similar way as expand() data. Expose via
UnicodeUtil and tableize basic latin for performance. Add
CaseFoldingFilter.

No Analyzer chains have been modified yet, but we should be able to
improve Unicode support by swapping out LowerCaseFilter as a followup.
Some filters such as GreekLowerCaseFilter can probably be eliminated.
@john-wagster
Copy link
Contributor

This is great; helps me progress some of the regex work in ES for why I started that CaseFolding work. Thanks for iterating on this @rmuir.

Copy link
Contributor

@john-wagster john-wagster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a contributor but for what it's worth I took a look at the PR (was tracking your PR's here as well @rmuir) and ran a couple of additional tests on characters I would expect to fold. Love the shift to generating the switch/case statements. lgtm

@msfroh
Copy link
Contributor

msfroh commented Mar 21, 2025

Awesome! Can I go ahead and use this for #14350 once it's merged?

Nevermind, you already mentioned it over there. Thanks a lot!

@rmuir
Copy link
Member Author

rmuir commented Mar 22, 2025

I will straighten out the build, this one is kinda draftish as it needs more tests etc. just wanted to toss out the idea.

If it is autogenerated we can easily maintain some cohesive story rather than crazy Unicode puzzles.

It is tempting to want full case folding as that's a benefit to eg German, but we need to step. Perf gets more complex, etc. Simple is an improvement over lowercasing.

The goal here is to not regress indexing performance if users switch from lowercase to simple case folding.

Copy link

github-actions bot commented Apr 5, 2025

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Apr 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants