Skip to content

Feed item content gets unescaped #209

@drweissbrot

Description

@drweissbrot

I'm trying to parse a feed and render its contents on a website. The feed sometimes contains HTML code blocks (think tutorial posts explaining how to do something in HTML, like this).

Take this example feed for instance:

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
	<channel>
		<item>
			<content:encoded><![CDATA[
				<pre><code>&lt;div class="wrapper">Lorem ipsum dolor sit amet&lt;/div></code></pre>
			]]></content:encoded>
		</item>
	</channel>
</rss>

Intuitively, I expected that parseFeed(xml).items[0].content would return something like:

<pre><code>&lt;div class="wrapper">Lorem ipsum dolor sit amet&lt;/div></code></pre>

Instead, the text for content gets unescaped (RSS, Atom), and this is returned instead:

<pre><code><div class="wrapper">Lorem ipsum dolor sit amet</div></code></pre>

While I do want the outer <pre> and <code> tags to be rendered as proper HTML tags on the final page, the inner div I want to keep verbatim, i.e. &lt;div class="wrapper">, so that it is rendered as text on the final website.

I made the changes to suit my needs in this commit, including some tests. I was unable to get most of the integration tests to actually pass, since the feedparser library (used to process feeds in tests) seems to unescape HTML in the same way, with no option to turn it off.

The way I did it would also be a breaking change; to avoid, assuming you even want to support this use case, perhaps we could add an options parameter to the parseFeed function to opt out of unescaping?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions