Skip to content

Commit 909716f

Browse files
authored
feat: keep input tag's class attr in table (#4064)
This change affects partition html. Previously when there is a table in the html, we clean any tags inside the table of their class and id attributes except for the class attribute for `img` tags. This change also preserves the class attribute for `input` tags inside a table. This change is reflected in a table element's metadata.text_as_html attribute.
1 parent 4468268 commit 909716f

File tree

4 files changed

+19
-4
lines changed

4 files changed

+19
-4
lines changed

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,8 @@
2121

2222
### Enhancements
2323

24+
- **`text_as_html` for Table element now keeps both `input` and `img` tag's `class` attribute** Previously in partition HTML any tag inside a table is stripped of its `class` attribute. Now this attribute is preserved for both `input` and `img` tag in the table element's `metadata.text_as_html`.
25+
2426
### Features
2527
- **Add language detection for PDFs** Add document and element level language detection to PDFs.
2628

test_unstructured/documents/test_ontology_to_unstructured_parsing.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,10 @@ def test_remove_ids_and_class_from_table():
3737
<td><IMG class="Signature" alt="cell 3"/></td>
3838
<td>cell 4</td>
3939
</tr>
40+
<tr>
41+
<td><input class="Checkbox" type="checkbox"/></td>
42+
<td>Option 1</td>
43+
</tr>
4044
</table>
4145
"""
4246
soup = BeautifulSoup(html_text, "html.parser")
@@ -52,6 +56,10 @@ def test_remove_ids_and_class_from_table():
5256
<td><img alt="cell 3" class="Signature"/></td>
5357
<td>cell 4</td>
5458
</tr>
59+
<tr>
60+
<td><input class="Checkbox" type="checkbox"/></td>
61+
<td>Option 1</td>
62+
</tr>
5563
</table>
5664
"""
5765
)

test_unstructured/partition/html/test_partition_v2.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,5 +71,7 @@ def test_attr_and_html_inside_table_cell_is_kept():
7171
html_parser_version="v2",
7272
)
7373

74-
assert '<input checked="" type="checkbox"/>' in table.metadata.text_as_html # class is removed
74+
assert (
75+
'<input checked="" class="Checkbox" type="checkbox"/>' in table.metadata.text_as_html
76+
) # class is removed
7577
assert 'colspan="2"' in table.metadata.text_as_html

unstructured/documents/ontology.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -147,13 +147,16 @@ def page_number(self) -> int | None:
147147
return None
148148

149149

150-
def remove_ids_and_class_from_table(soup: BeautifulSoup) -> BeautifulSoup:
150+
def remove_ids_and_class_from_table(
151+
soup: BeautifulSoup, class_attr_to_keep: list[str] = ["img", "input"]
152+
) -> BeautifulSoup:
151153
"""
152154
Remove id and class attributes from tags inside tables,
153-
except preserve class attributes for img tags.
155+
except preserve class attributes for selected tags.
154156
155157
Args:
156158
soup: BeautifulSoup object containing the HTML
159+
class_attr_to_keep: a list of tag names whose class attr will be kept
157160
158161
Returns:
159162
BeautifulSoup: Modified soup with attributes removed
@@ -162,7 +165,7 @@ def remove_ids_and_class_from_table(soup: BeautifulSoup) -> BeautifulSoup:
162165
if tag.name.lower() == "table": # type: ignore
163166
continue # We keep table tag
164167
tag.attrs.pop("id", None) # type: ignore
165-
if tag.name.lower() != "img": # type: ignore
168+
if tag.name.lower() not in class_attr_to_keep: # type: ignore
166169
tag.attrs.pop("class", None) # type: ignore
167170
return soup
168171

0 commit comments

Comments
 (0)