Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

partition_html is returning javascript code from some HTML documents #149

Closed
MthwRobinson opened this issue Jan 12, 2023 · 2 comments · Fixed by #313
Closed

partition_html is returning javascript code from some HTML documents #149

MthwRobinson opened this issue Jan 12, 2023 · 2 comments · Fixed by #313
Labels
bug Something isn't working python Pull requests that update Python code

Comments

@MthwRobinson
Copy link
Contributor

MthwRobinson commented Jan 12, 2023

Currently, the partition_html function is returning javascript code in some html documents. The goal of this issue is to update our partitioning logic so that this javascript code doesn't come through in the example document.

Steps to reproduce

import requests
from unstructured.partition.html import partition_html

url = "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-december-13"
r = requests.get(url)
elements = partition_html(text=r.text)
 print("\n\n".join([str(el) for el in elements[:5]]))

You should see the following javascript code in elements[1].text

'(function(d){\n  var js, id = \'facebook-jssdk\'; if (d.getElementById(id)) {return;}\n  js = d.createElement(\'script\'); js.id = id; js.async = true;\n  js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";\n  d.getElementsByTagName(\'head\')[0].appendChild(js);\n}(document));'
@MthwRobinson MthwRobinson added bug Something isn't working help wanted python Pull requests that update Python code labels Jan 12, 2023
@MthwRobinson
Copy link
Contributor Author

For the document in question, it looks like the offending javascript is actually coming back in a <td> tag. is_possible_narrative_text is also flagging this block as narrative, which isn't right. I think we actually are already the script tags.

'(function(d){\n  var js, id = \'facebook-jssdk\'; if (d.getElementById(id)) {return;}\n  js = d.createElement(\'script\'); js.id = id; js.async = true;\n  js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";\n  d.getElementsByTagName(\'head\')[0].appendChild(js);\n}(document));'

@MthwRobinson
Copy link
Contributor Author

Updating the issue description to reflect the last comment

@MthwRobinson MthwRobinson changed the title partition_html is returning javascript code from <script> tags partition_html is returning javascript code from some HTML documents Feb 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Pull requests that update Python code
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant