Description
Describe the issue
XML allows Unicode letters as element and attribute names while your xml.js language mode uses a regular expression just checking for ASCII letters A-Z.
That way anyone trying to highlight XML with non-ASCII letters in element or attribute names doesn't get highlighting e.g. in <categoría>producto</categoría>
the Spanish word categoría
which is a well-formed XML element name is not recognized as such by the regular expression const TAG_NAME_RE = regex.concat(/[A-Z_]/, regex.optional(/[A-Z0-9_.-]*:/), /[A-Z0-9_.-]*/);
in https://github.com/highlightjs/highlight.js/blob/main/src/languages/xml.js#L12
Which language seems to have the issue?
XML from https://github.com/highlightjs/highlight.js/blob/main/src/languages/xml.js
Are you using highlight
or highlightAuto
?
highlight
Sample Code to Reproduce
console.log(hljs.highlight(`
<root>
<categoría>test</categoría>
<category>test</category>
</root>`, {language: 'xml'}).value)
Expected behavior
The output for the XML markup <categoría>test</categoría>
currently is <categoría>test</categoría>
while it should be <span class="hljs-tag"><<span class="hljs-name">categoría</span>></span>test<span class="hljs-tag"></<span class="hljs-name">categoría</span>></span>
.
Additional context
https://www.w3.org/TR/xml/#NT-NameStartChar and https://www.w3.org/TR/xml/#NT-NameChar definitions from XML spec. I think it should be possible to fix the regular expressions used in xml.js, either by using ranges of the characters given in the XML spec or, if the Unicode support in JavaScript regular expressions is used, by using e.g. \p{Letter}
instead of A-Z
.