Skip to content

Conversation

MthwRobinson
Copy link
Contributor

Summary

Fixes a bug introduced in #313 and closes #332 . Strip script elements from the XML document rather than skipping them in the iteration step. The switch from tag_elem.itertext() to tag_elem.iter() had caused some text to be missed.

Testing

The following should show a full paragraph of text output now:

from unstructured.partition.html import partition_html

url = "http://paulgraham.com/getideas.html"
elements = partition_html(url=url)
print("\n\n".join([str(el) for el in elements]))

@MthwRobinson MthwRobinson requested a review from LaverdeS March 2, 2023 16:49
@@ -0,0 +1,44 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition. Maybe better to have also one of these in the main repo?

@MthwRobinson MthwRobinson merged commit a5da3de into main Mar 2, 2023
@MthwRobinson MthwRobinson deleted the fix/missing-text-in-html branch March 2, 2023 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug/issue parsing html file
3 participants