Skip to content

Conversation

@jtojnar
Copy link
Contributor

@jtojnar jtojnar commented Mar 4, 2025

DOMDocument::loadHTML will parse HTML documents as ISO-8859-1 if there is no meta[charset] tag. This means that UTF-8-encoded HTML fragments such as those coming from JSON-LD articleBody field would be parsed with incorrect encoding.

In f14428e, we tried to resolve it by putting meta[charset] tag at the start of the HTML fragment. Unfortunately, it turns out that causes parser to auto-insert a html element, losing the attributes of the original html tag.

Let’s try to insert the meta[charset] tag into the proper place in the HTML document.

We do not need to use the same trick with JSLikeHTMLElement::__set since that expects smaller HTML fragments, not html documents, so creating html and head elements will not be a problem.

Also include some unrelated test cleanups I noticed during.

jtojnar added 3 commits March 3, 2025 23:53
`DOMDocument::loadHTML` will parse HTML documents as ISO-8859-1 if there is no `meta[charset]` tag. This means that UTF-8-encoded HTML fragments such as those coming from JSON-LD `articleBody` field would be parsed with incorrect encoding.

In f14428e, we tried to resolve it by putting `meta[charset]` tag at the start of the HTML fragment. Unfortunately, it turns out that causes parser to auto-insert a `html` element, losing the attributes of the original `html` tag.

Let’s try to insert the `meta[charset]` tag into the proper place in the HTML document.

We do not need to use the same trick with `JSLikeHTMLElement::__set`.
That expects smaller HTML fragments, not `html` documents, so creating `html` and `head` elements will not be a problem.
@j0k3r
Copy link
Owner

j0k3r commented Mar 4, 2025

Thanks for fixing it!

@jtojnar
Copy link
Contributor Author

jtojnar commented Mar 4, 2025

Thanks. What are the next steps here?

@j0k3r
Copy link
Owner

j0k3r commented Mar 4, 2025

I can merge every thing and cut a release for 1.x & 2.x

@j0k3r j0k3r merged commit 7413a38 into j0k3r:master Mar 4, 2025
10 checks passed
@jtojnar jtojnar deleted the html-shadowing branch March 4, 2025 09:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants