Skip to content

Not all HTML Entities are unescaped from title and other metadata (when double-escaped by websites) #820

Open
@maxpatiiuk

Description

@maxpatiiuk

There is a function in the code for unescaping HTML entities:

readability/Readability.js

Lines 1353 to 1365 in 2524fe3

_unescapeHtmlEntities: function(str) {
if (!str) {
return str;
}
var htmlEscapeMap = this.HTML_ESCAPE_MAP;
return str.replace(/&(quot|amp|apos|lt|gt);/g, function(_, tag) {
return htmlEscapeMap[tag];
}).replace(/&#(?:x([0-9a-z]{1,4})|([0-9]{1,4}));/gi, function(_, hex, numStr) {
var num = parseInt(hex || numStr, hex ? 16 : 10);
return String.fromCharCode(num);
});
},

However, it does not capture all possible HTML entities (’)

Example

On the page https://www.scientificamerican.com/podcast/episode/heres-why-actors-are-so-worried-about-ai/

There is this meta tag: <meta property="og:title" content="Here&amp;rsquo;s Why Actors Are So Worried about AI">

The page title is extracted from it.

Special html entities are supposed to be unescaped by this function, but they are not:

metadata.title = this._unescapeHtmlEntities(metadata.title);

The metadata.title before calling this._unescapeHtmlEntities and after is the same:

Here&rsquo;s Why Actors Are So Worried about AI

Solution & Workaround

According to https://stackoverflow.com/a/34064434/8584605, a more effective (and still safe) way to unescape HTML entities would look like this:

function htmlDecode(input) {
  const doc = new DOMParser().parseFromString(input, "text/html");
  return doc.documentElement.textContent;
}

Until this bug is fixed, I would be using the above function to post-process the title outputted by Readability.js

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions