Not all HTML Entities are unescaped from title and other metadata (when double-escaped by websites)

There is a function in the code for unescaping HTML entities:

https://github.com/mozilla/readability/blob/2524fe371da2356b0bb79e0d34b028fa23388cd3/Readability.js#L1353-L1365

However, it does not capture all possible HTML entities (`&rsquo;`)

## Example

On the page https://www.scientificamerican.com/podcast/episode/heres-why-actors-are-so-worried-about-ai/

There is this meta tag: `<meta property="og:title" content="Here&amp;rsquo;s Why Actors Are So Worried about AI">`

The page title is extracted from it.

Special html entities are supposed to be unescaped by this function, but they are not:

https://github.com/mozilla/readability/blob/2524fe371da2356b0bb79e0d34b028fa23388cd3/Readability.js#L1553

The metadata.title before calling `this._unescapeHtmlEntities` and after is the same:

`Here&rsquo;s Why Actors Are So Worried about AI`

## Solution & Workaround

According to https://stackoverflow.com/a/34064434/8584605, a more effective (and still safe) way to unescape HTML entities would look like this:

```js
function htmlDecode(input) {
  const doc = new DOMParser().parseFromString(input, "text/html");
  return doc.documentElement.textContent;
}
```

Until this bug is fixed, I would be using the above function to post-process the title outputted by Readability.js

	_unescapeHtmlEntities: function(str) {
	if (!str) {
	return str;
	}

	var htmlEscapeMap = this.HTML_ESCAPE_MAP;
	return str.replace(/&(quot\|amp\|apos\|lt\|gt);/g, function(_, tag) {
	return htmlEscapeMap[tag];
	}).replace(/&#(?:x([0-9a-z]{1,4})\|([0-9]{1,4}));/gi, function(_, hex, numStr) {
	var num = parseInt(hex \|\| numStr, hex ? 16 : 10);
	return String.fromCharCode(num);
	});
	},

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not all HTML Entities are unescaped from title and other metadata (when double-escaped by websites) #820

Example

Solution & Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Not all HTML Entities are unescaped from title and other metadata (when double-escaped by websites) #820

Description

Example

Solution & Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions