fix: ensure all text is maintained in html output #335

MthwRobinson · 2023-03-02T16:49:53Z

Summary

Fixes a bug introduced in #313 and closes #332 . Strip script elements from the XML document rather than skipping them in the iteration step. The switch from tag_elem.itertext() to tag_elem.iter() had caused some text to be missed.

Testing

The following should show a full paragraph of text output now:

from unstructured.partition.html import partition_html

url = "http://paulgraham.com/getideas.html"
elements = partition_html(url=url)
print("\n\n".join([str(el) for el in elements]))

unstructured/documents/html.py

LaverdeS · 2023-03-02T18:27:03Z

example-docs/ideas-page.html

@@ -0,0 +1,44 @@
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">


Nice addition. Maybe better to have also one of these in the main repo?

unstructured/documents/html.py

MthwRobinson added 3 commits March 2, 2023 11:44

fix: ensure all text is maintained in html pages

b1a4163

add back in replace unicode quotes

38e3080

changelog and version bump

3062507

MthwRobinson requested a review from LaverdeS March 2, 2023 16:49

tomaarsen reviewed Mar 2, 2023

View reviewed changes

unstructured/documents/html.py Show resolved Hide resolved

apt-get update in ci

6a80366

LaverdeS approved these changes Mar 2, 2023

View reviewed changes

white space differences in output

bb9c421

MthwRobinson merged commit a5da3de into main Mar 2, 2023

MthwRobinson deleted the fix/missing-text-in-html branch March 2, 2023 19:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: ensure all text is maintained in html output #335

fix: ensure all text is maintained in html output #335

Uh oh!

MthwRobinson commented Mar 2, 2023

Uh oh!

Uh oh!

LaverdeS Mar 2, 2023

Uh oh!

Uh oh!

Uh oh!

		@@ -0,0 +1,44 @@
		<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

fix: ensure all text is maintained in html output #335

fix: ensure all text is maintained in html output #335

Uh oh!

Conversation

MthwRobinson commented Mar 2, 2023

Summary

Testing

Uh oh!

Uh oh!

LaverdeS Mar 2, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!