Skip to content

LUCENE-10520 / #11556 HTMLStripCharFilter bugfix #11724

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Oct 2, 2023

Conversation

elliotzlin
Copy link
Contributor

Description

"<" and ">" characters are acceptable in attribute values per the HTML5 spec. See #11556 for more details.

@rmuir
Copy link
Member

rmuir commented Aug 29, 2022

This change looks good to me! I just want to clone the branch and try to figure how you removed 25,000 lines of code from the generated jflex :)

@dweiss
Copy link
Contributor

dweiss commented Aug 30, 2022

What?! How?! :) Seems like the automaton got a whole lot smaller:

   private static int [] zzUnpackAttribute() {
-    int [] result = new int[14794];
+    int [] result = new int[3082]; 

@elliotzlin
Copy link
Contributor Author

I hadn't even noticed that! I hope it's not some JFlex side-effect that I unwittingly triggered with my change.

@rmuir
Copy link
Member

rmuir commented Aug 30, 2022

I don't think you did anything wrong. The changes look fine, it is just a real puzzler.

I even cloned your repo and inspected the enormous diff to the generated code and I only see smaller DFAs, not any "logical changes". The test suite does pass.

@elliotzlin
Copy link
Contributor Author

@rmuir @dweiss I wanted to follow up on this. Is there anything I should work on here or is this good to go?

@dweiss
Copy link
Contributor

dweiss commented Sep 13, 2022

Sorry this slipped, @elliotzlin . I think it's fine. Please add a lucene/CHANGES.txt entry (it'll go towards 9.5.0). I'll handle the commit and backport.

@elliotzlin
Copy link
Contributor Author

@dweiss can you help run the workflows again?

@elliotzlin
Copy link
Contributor Author

@rmuir would you be able to help with running the workflows?

@elliotzlin elliotzlin requested a review from dweiss September 30, 2022 00:56
@elliotzlin
Copy link
Contributor Author

@dweiss @rmuir bumping this old thread again.

@elliotzlin
Copy link
Contributor Author

@dweiss @rmuir apologies for reviving this ancient thread, but I was wondering if one of you can help run the PR checks again?

@dweiss
Copy link
Contributor

dweiss commented Oct 1, 2023

I will take a look tomorrow - was out of office last week.

Copy link
Contributor

@dweiss dweiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. These broken HTML examples are so hairy that I found them hard to understand... if somebody wants to sanitize such HTML then a preflight with tagsoup-style html tidy-up tool would probably be a wise step.

@dweiss dweiss merged commit 4cff584 into apache:main Oct 2, 2023
@dweiss dweiss added this to the 9.9.0 milestone Oct 2, 2023
@dweiss dweiss added the type:bug label Oct 2, 2023
asfgit pushed a commit that referenced this pull request Oct 2, 2023
Add generated HTMLStripCharFilter and update tests to reflect correct char filter behavior
@mjustice3
Copy link
Contributor

I believe I've found a regression with this bugfix. A test case that exposes:

  public void testForIssue10520Regression() throws IOException {
    String test =
        "<!DOCTYPE html><html lang=\"en\"><head><title>Test</title></head><a href=\"https://www.somewhere.com?data=\">a link</a> some text <a href=\"https://www.elsewhere.com\">another link</a></html>";
    Reader reader = new StringReader(test);
    HTMLStripCharFilter filter = new HTMLStripCharFilter(reader);
    StringWriter result = new StringWriter();
    filter.transferTo(result);
    assertEquals("Test\n\na link some text another link", result.toString().trim());
  }

The problem is with the empty data= parameter at the end of the first url. We see a few of those in our document set and that is how I noticed.

mjustice3 added a commit to mjustice3/lucene that referenced this pull request Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants