Description
There is a function in the code for unescaping HTML entities:
Lines 1353 to 1365 in 2524fe3
However, it does not capture all possible HTML entities (’
)
Example
On the page https://www.scientificamerican.com/podcast/episode/heres-why-actors-are-so-worried-about-ai/
There is this meta tag: <meta property="og:title" content="Here&rsquo;s Why Actors Are So Worried about AI">
The page title is extracted from it.
Special html entities are supposed to be unescaped by this function, but they are not:
Line 1553 in 2524fe3
The metadata.title before calling this._unescapeHtmlEntities
and after is the same:
Here’s Why Actors Are So Worried about AI
Solution & Workaround
According to https://stackoverflow.com/a/34064434/8584605, a more effective (and still safe) way to unescape HTML entities would look like this:
function htmlDecode(input) {
const doc = new DOMParser().parseFromString(input, "text/html");
return doc.documentElement.textContent;
}
Until this bug is fixed, I would be using the above function to post-process the title outputted by Readability.js