-
Notifications
You must be signed in to change notification settings - Fork 1.2k
LUCENE-10520 / #11556 HTMLStripCharFilter bugfix #11724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-10520 / #11556 HTMLStripCharFilter bugfix #11724
Conversation
This change looks good to me! I just want to clone the branch and try to figure how you removed 25,000 lines of code from the generated jflex :) |
What?! How?! :) Seems like the automaton got a whole lot smaller:
|
I hadn't even noticed that! I hope it's not some JFlex side-effect that I unwittingly triggered with my change. |
I don't think you did anything wrong. The changes look fine, it is just a real puzzler. I even cloned your repo and inspected the enormous diff to the generated code and I only see smaller DFAs, not any "logical changes". The test suite does pass. |
Sorry this slipped, @elliotzlin . I think it's fine. Please add a lucene/CHANGES.txt entry (it'll go towards 9.5.0). I'll handle the commit and backport. |
.../analysis/common/src/test/org/apache/lucene/analysis/charfilter/TestHTMLStripCharFilter.java
Show resolved
Hide resolved
.../analysis/common/src/test/org/apache/lucene/analysis/charfilter/TestHTMLStripCharFilter.java
Outdated
Show resolved
Hide resolved
@dweiss can you help run the workflows again? |
@rmuir would you be able to help with running the workflows? |
I will take a look tomorrow - was out of office last week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. These broken HTML examples are so hairy that I found them hard to understand... if somebody wants to sanitize such HTML then a preflight with tagsoup-style html tidy-up tool would probably be a wise step.
Add generated HTMLStripCharFilter and update tests to reflect correct char filter behavior
I believe I've found a regression with this bugfix. A test case that exposes:
The problem is with the empty |
Description
"<" and ">" characters are acceptable in attribute values per the HTML5 spec. See #11556 for more details.