Skip to content

Conversation

@badGarnet
Copy link
Collaborator

This change affects partition html.

Previously when there is a table in the html, we clean any tags inside the table of their class and id attributes except for the class attribute for img tags. This change also preserves the class attribute for input tags inside a table. This change is reflected in a table element's metadata.text_as_html attribute.

@badGarnet
Copy link
Collaborator Author

@claude review

@claude
Copy link

claude bot commented Jul 16, 2025

Claude finished @badGarnet's task —— View job


PR Review for feat: keep input tag's class attr in table

Todo List:

  • Read and analyze the main implementation changes in ontology.py
  • Review test changes and coverage
  • Check changelog and version updates
  • Provide comprehensive code review feedback

Code Review

Summary

This PR extends the existing functionality that preserves img tag class attributes within HTML tables to also preserve input tag class attributes. The implementation is clean and follows the existing pattern well.

✅ Strengths

  1. Consistent Implementation: The change follows the exact same pattern as the existing img tag preservation, maintaining consistency in the codebase.

  2. Proper Default Parameters: The remove_ids_and_class_from_table function now defaults to ["img", "input"] for class_attr_to_keep parameter, which maintains backward compatibility while extending functionality.

  3. Comprehensive Test Coverage: The test test_remove_ids_and_class_from_table correctly validates that:

    • input tags preserve their class attribute (class="Checkbox")
    • img tags still preserve their class attribute (existing functionality)
    • Other tags like tr have their class attributes removed as expected
  4. Proper Documentation: The changelog entry accurately describes the feature and its impact on metadata.text_as_html.

  5. Function Documentation: The function docstring at ontology.py:153-163 correctly describes the new behavior.

🔍 Technical Analysis

File: unstructured/documents/ontology.py

  • Lines 151-152: The function signature change is minimal and maintains backward compatibility
  • Lines 168-169: The logic correctly checks if the tag name is in the class_attr_to_keep list before removing the class attribute
  • Lines 313, 482: Both Table.to_html() and TableOfContents.to_html() methods properly call the updated function

File: test_unstructured/documents/test_ontology_to_unstructured_parsing.py

  • Lines 41-43: The test case adds an input tag with realistic attributes (class="Checkbox" and type="checkbox")
  • Line 60: The expected output correctly shows the preserved class attribute

🎯 Areas for Consideration

  1. Edge Cases: Consider testing with:

    • Multiple classes on input tags (class="class1 class2")
    • Input tags with different types (text, radio, submit, etc.)
    • Mixed case HTML (<INPUT> vs <input>)
  2. Performance: The current implementation calls tag.name.lower() for each tag. While not a major concern for typical use cases, it's worth noting.

  3. Documentation: The PR description mentions this affects metadata.text_as_html but doesn't show a concrete example of the before/after behavior.

📋 Minor Suggestions

  1. Test Enhancement: Consider adding a test case that verifies the text_as_html metadata contains the preserved class attributes:

    # Example test to verify metadata.text_as_html preservation
    table_element = Table(children=[...])
    assert 'class="Checkbox"' in table_element.metadata.text_as_html
  2. Code Style: The implementation is clean and follows existing patterns well.

✅ Security & Performance

  • No security concerns identified
  • Performance impact is minimal and consistent with existing functionality
  • No sensitive data exposure risks

🏆 Overall Assessment

This is a well-implemented, focused change that extends existing functionality in a consistent manner. The code is clean, properly tested, and documented. The PR successfully achieves its stated goal of preserving input tag class attributes in HTML tables alongside the existing img tag preservation.

Recommendation: ✅ APPROVE

The changes are ready for merge with the minor suggestions being optional enhancements for future consideration.

@badGarnet badGarnet marked this pull request as ready for review July 16, 2025 21:08
@badGarnet badGarnet marked this pull request as draft July 16, 2025 21:09
Copy link
Contributor

@MaksOpp MaksOpp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@badGarnet badGarnet marked this pull request as ready for review July 16, 2025 21:13
@badGarnet badGarnet enabled auto-merge July 16, 2025 21:14
@badGarnet badGarnet added this pull request to the merge queue Jul 16, 2025
Merged via the queue into main with commit 909716f Jul 16, 2025
36 checks passed
@badGarnet badGarnet deleted the feat/preserve-class-attr-for-input-in-table branch July 16, 2025 22:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants