Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Remove JavaScript from HTML reader output #313

Merged

Conversation

tomaarsen
Copy link
Contributor

Closes #149

Hello!

Pull Request overview

  • No longer turn JavaScript from script tags in an HTML document into an Element.

Bug

See #149 for details - it contains the following snippet:

import requests
from unstructured.partition.html import partition_html

url = "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-december-13"
r = requests.get(url)
elements = partition_html(text=r.text)
print("\n\n".join([str(el) for el in elements[:5]]))

Which outputs:

Skip to main content

(function(d){
  var js, id = 'facebook-jssdk'; if (d.getElementById(id)) {return;}
  js = d.createElement('script'); js.id = id; js.async = true;
  js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
  d.getElementsByTagName('head')[0].appendChild(js);
}(document));

Search form

Home

Who We Are

Details

The bug was caused by the _construct_text function, which was called using a <td> tag. Within this tag, there is a <script> tag, containing the JS. The _construct_text function will happily loop through this <script> tag and consider all of its contents as "text".
This fix verifies whether the tag that we're looping over is not a <script> tag, but recognize that this is a bit of a bandaid solution on a bigger problem: there may be other tags for which we don't want the text to be included in the HTML reader output.

However, for the time being, this is a major improvement.

After this PR

Skip to main content

Search form

Home

Who We Are

Research
  • Tom Aarsen

@tomaarsen tomaarsen marked this pull request as draft February 28, 2023 17:14
Copy link
Contributor

@MthwRobinson MthwRobinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending one tweak to the CHANGELOG. Thanks!

CHANGELOG.md Outdated
@@ -1,4 +1,4 @@
## 0.5.0
## 0.5.0-dev1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add this under 0.5.1-dev1 and add it as a new entry above 0.5.0?

Copy link
Contributor

@MthwRobinson MthwRobinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually looks like there's a test failing. Please also add a new test that includes a script tags and tests that the script content isn't included in the output.

@tomaarsen
Copy link
Contributor Author

Actually looks like there's a test failing. Please also add a new test that includes a script tags and tests that the script content isn't included in the output.

There is indeed! I converted it to draft until I resolve it.
I'll also tackle the CHANGELOG nitpick.

As it turns out, itertext has "with_tail=True" by default, and performing the same manually with "iter" requires manually adding the tail
@tomaarsen tomaarsen marked this pull request as ready for review February 28, 2023 20:35
Copy link
Contributor

@MthwRobinson MthwRobinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add a test to test_html.py to test that the script tag content gets removed?

@tomaarsen
Copy link
Contributor Author

Will do

@tomaarsen
Copy link
Contributor Author

Should be ready now. Apologies for forgetting to include the new test earlier. The test case is simple: it just checks whether the output elements contain function (. That should be sufficient to ensure that no regression is introduced in the future.
I want to point out that this test case fails when moved to main.

Copy link
Contributor

@MthwRobinson MthwRobinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with the test added. Thanks!

@cragwolfe cragwolfe merged commit 350c423 into Unstructured-IO:main Feb 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

partition_html is returning javascript code from some HTML documents
3 participants