Open
Conversation
This commit fixes issue #25 by replacing the XML parser with lxml's HTML parser for HTML files. The XML parser incorrectly required quoted attributes (e.g., name="foo"), but HTML5 allows unquoted attributes (e.g., name=foo). Changes: - Added lxml>=4.9.0 as a dependency in pyproject.toml - Overrode build_tree() method in HTML class to use lxml.html parser - Added _lxml_to_et() helper to convert lxml trees to ElementTree format - Implemented graceful fallback to XML parser if lxml is not available - Added comprehensive tests for both quoted and unquoted HTML attributes - XML parsing remains unchanged and continues to use strict ET parser The lxml.html parser provides lenient HTML5-compliant parsing while maintaining compatibility with the existing XMLElement tree structure. All 66 existing tests pass with no regressions. Fixes #25 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit adds a dedicated test file (test_html.py) with comprehensive tests for the HTML unquoted attributes fix (issue #25). Tests added: - test_unquoted_attributes: Tests the exact issue from #25 - parsing HTML with unquoted attributes like <meta name=foo> - test_mixed_quoted_unquoted_attributes: Ensures mixed quoted/unquoted attributes work correctly - test_quoted_attributes_backward_compatibility: Verifies backward compatibility with fully quoted HTML - test_complex_unquoted_attributes: Tests more complex real-world cases with multiple unquoted attributes All tests verify that: 1. HTML files parse without errors 2. Trees are built successfully 3. Diffs can be computed between files 4. Changes are correctly captured in the diff output The test file is designed to be extensible for future HTML-related improvements. Related to #25 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit adds tests for two additional HTML parsing issues that are resolved by the lxml HTML parser: Issue #26 - Fails to match closing HTML tags: - Added test_closing_tags_issue_26() to verify that HTML files with closing tags like </head> parse and diff correctly - The XML parser previously threw errors on closing tags - lxml's HTML parser handles these correctly Issue #80 - Text missing from HTML diff: - Added test_text_between_elements_issue_80() to verify that text nodes between elements are correctly parsed and included in diffs - Tests the exact example from the issue: text outside of tags like "some <div>text</div> more text" - lxml correctly parses text nodes as element text and tail attributes Both tests pass, confirming that the lxml HTML parser (introduced to fix issue #25) also resolves these related HTML parsing issues. Test results: 72 tests pass (66 original + 6 HTML tests) Related to #26, #80 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Author
|
cc: @smoelius |
ESultanik
requested changes
Dec 1, 2025
Collaborator
ESultanik
left a comment
There was a problem hiding this comment.
This looks good, but there's one recursive function that will almost certainly cause a stack overflow on large HTML files. It needs to be converted to an iterative function.
|
|
||
| # Recursively convert children | ||
| for child in lxml_elem: | ||
| et_elem.append(HTML._lxml_to_et(child)) |
Collaborator
There was a problem hiding this comment.
This recursion will exhaust Python's tiny stack if the HTML DOM is very deep. _lxml_to_et needs to be converted to an iterative function (e.g., using a stack).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR replaces the XML parser with lxml's HTML parser for HTML files, fixing three long-standing issues with HTML parsing in graphtage.
Issues Resolved
Fixes #25 - Unquoted HTML attributes
Fixes #26 - Closing tag matching errors
Fixes #80 - Text nodes between elements missing from diffs
This PR implements a proper HTML parser using lxml.html:
Test Coverage
Added 6 new HTML-specific tests covering:
✅ Unquoted attributes ()
✅ Mixed quoted/unquoted attributes
✅ Backward compatibility with quoted attributes
✅ Complex real-world attribute patterns
✅ Closing tag handling
✅ Text nodes between elements