Skip to content

Commit f14428e

Browse files
committed
Do not use mb_convert_encoding with HTML-ENTITIES as target encoding
This is deprecated since PHP 8.2: Deprecated: mb_convert_encoding(): Handling HTML entities via mbstring is deprecated; use htmlspecialchars, htmlentities, or mb_encode_numericentity/mb_decode_numericentity instead It was used because `DOMDocument`, which uses libxml2 internally, will parse the HTML as ISO-8859-1, unless the document contains an XML encoding declaration or HTML meta tag setting character set. Since first such element wins, putting the `meta[charset]` up front will ensure the parser uses the correct encoding, even if the document contains incorrect meta tag (e.g. when the document is converted to UTF-8 without also updating the metadata by the software passing it to Readability). https://stackoverflow.com/a/39148511/160386
1 parent 23f824a commit f14428e

File tree

2 files changed

+2
-3
lines changed

2 files changed

+2
-3
lines changed

src/JSLikeHTMLElement.php

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -79,14 +79,13 @@ public function __set($name, $value)
7979
} else {
8080
// $value is probably ill-formed
8181
$f = new \DOMDocument();
82-
$value = mb_convert_encoding($value, 'HTML-ENTITIES', 'UTF-8');
8382

8483
// Using <htmlfragment> will generate a warning, but so will bad HTML
8584
// (and by this point, bad HTML is what we've got).
8685
// We use it (and suppress the warning) because an HTML fragment will
8786
// be wrapped around <html><body> tags which we don't really want to keep.
8887
// Note: despite the warning, if loadHTML succeeds it will return true.
89-
$result = $f->loadHTML('<htmlfragment>' . $value . '</htmlfragment>');
88+
$result = $f->loadHTML('<meta charset="utf-8"><htmlfragment>' . $value . '</htmlfragment>');
9089

9190
if ($result) {
9291
$import = $f->getElementsByTagName('htmlfragment')->item(0);

src/Readability.php

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1426,7 +1426,7 @@ private function loadHtml(): void
14261426
unset($tidy);
14271427
}
14281428

1429-
$this->html = mb_convert_encoding((string) $this->html, 'HTML-ENTITIES', 'UTF-8');
1429+
$this->html = '<meta charset="utf-8">' . (string) $this->html;
14301430

14311431
if ('html5lib' === $this->parser || 'html5' === $this->parser) {
14321432
$this->dom = (new HTML5())->loadHTML($this->html);

0 commit comments

Comments
 (0)