fix: respect <base> tag in html2text conversion#1703
Closed
biplavbarua wants to merge 4 commits intounclecode:developfrom
Closed
fix: respect <base> tag in html2text conversion#1703biplavbarua wants to merge 4 commits intounclecode:developfrom
biplavbarua wants to merge 4 commits intounclecode:developfrom
Conversation
biplavbarua
commented
Jan 14, 2026
Author
biplavbarua
left a comment
There was a problem hiding this comment.
LGTM! This fixes the critical issue of broken relative links when a tag is present.
Minor implementation note: Since html2text is a stream parser, this update applies to all subsequent tags. While the HTML spec mandates the first tag in controls the whole document (including elements before it), strictly parsing that requires a two-pass approach or a full DOM tree.
Given the constraints of HTMLParser, this is the correct pragmatic solution.
Verified that urljoin logic correctly handles the accumulation/replacement of the base path.
Author
|
Re-verified local build against latest master. Fix is stable. |
Owner
|
Closing — this is addressed by #1721 which has been merged. Thanks for the contribution though. |
unclecode
added a commit
that referenced
this pull request
Feb 1, 2026
- PR #1714: Replace tf-playwright-stealth with playwright-stealth - PR #1721: Respect <base> tag in html2text for relative links - PR #1719: Include GoogleSearchCrawler script.js in package data - PR #1717: Allow local embeddings by removing OpenAI fallback - Fix: Extract <base href> from raw HTML before head gets stripped - Close duplicates: #1703, #1698, #1697, #1710, #1720 - Update CONTRIBUTORS.md and PR-TODOLIST.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
The
html2textparser currently ignores the HTML<base>tag. This PR adds logic to detect the<base>tag and update the parser's base URL accordingly, ensuring that relative links are resolved correctly.Related Issue
Fixes #1680
Verification
tests/test_base_tag_local.pywhich passes.hrefspecified in the<base>tag.