Skip to content

autolink for non-HTTP URIs, and other non-tag content, produces invalid XML #1244

Open
@nxg

Description

@nxg

Consider the following:

import markdown

md = '''
Here are some elements:

  * url <http://example.org>
  * repo url <ssh://example.org>, which is a non-HTTP URL
  * and <urn:foo> is something else
  * ssh url2 <ssh:me@example.org>, handled as an email address
  * misc element <em>boo!</em>
'''

converter = markdown.Markdown()
print(converter.convert(md))

This renders as

<p>Here are some elements:</p>
<ul>
<li>url <a href="http://example.org">http://example.org</a></li>
<li>repo url <ssh://example.org>, which is a non-HTTP URL</li>
<li>and <urn:foo> is something else</li>
<li>ssh url2 <a href="&#109;&#97;&#105;&#108;&#116;&#111;&#58;&#115;&#115;&#104;&#58;&#109;&#101;&#64;&#101;&#120;&#97;&#109;&#112;&#108;&#101;&#46;&#111;&#114;&#103;">&#115;&#115;&#104;&#58;&#109;&#101;&#64;&#101;&#120;&#97;&#109;&#112;&#108;&#101;&#46;&#111;&#114;&#103;</a>, handled as an email address</li>
<li>misc element <em>boo!</em></li>
</ul>

I think items number 2 and 3 are incorrect, (a) because the behaviour doesn't match two significant Markdown specs, and (b) because they are both invalid XML (yes, <urn:foo> looks like an XML element with a namespace prefix; let's not go there...).

The autolink feature in the Daring Fireball spec is ‘for URLs and email addresses’ (though the only URL in that example is an HTTP URL). The corresponding section in the CommonMark spec says that the autolink should happen for an absolute URI. So the second case should be turned into <a href='ssh://example.org'>ssh://example.org</a>.

What appears to be happening, instead, is that this is being interpreted as literal HTML. The relevant section of Gruber's spec is rather vague, but the corresponding part of the CommonMark spec says that this should happen only to ‘[t]ext between < and > that looks like an HTML tag’, which of course <ssh://example.org> doesn't (CommonMark: ‘A tag name consists of an ASCII letter followed by zero or more ASCII letters, digits, or hyphens (-)’).

Independently of any spec, however, having <ssh://example.org> appear in the output means that that output is syntactically invalid, and I feel this shouldn't happen for any input, however insane.

Suggestion:

  • When <starttag> consists of something other than [a-zA-Z][a-zA-Z0-9-]*, then it is either a URI, in which case it should be turned into an <a> element, or it is not, in which case it should be included literally in the output, as if the content were instead enclosed in backticks.

This would imply that item 3 should render as <code>urn:foo</code>.

Metadata

Metadata

Assignees

No one assigned

    Labels

    confirmedConfirmed bug report or approved feature request.coreRelated to the core parser code.featureFeature request.someday-maybeApproved low priority request.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions