enh(xml) Allow Unicode attribute and tag names #3257

martin-honnen · 2021-06-25T06:49:16Z

Fixing regular expressions TAG_NAME_RE and XML_IDENT_RE to include non-ASCII letters the XML spec allows as Unicode range \uDDDD-\uDDDD, as far as JavaScript allows that with four digit hex codes

This is my attempt to fix issue 3256 #3256, instead of using only ASCII letters I fixed the regular expressions to include non-ASCII letters from the XML spec, expressing them as Unicode range \uDDDD-\uDDDD

Changes

Included non-ASCII letters the XML spec allows as Unicode ranges in regular expressions TAG_NAME_RE and XML_IDENT_RE

Checklist

Added markup test non-ascii-element-name.txt
Updated the changelog at CHANGES.md

…llows in names. Adding test case.

…ML spec allows as Unicode ranges in the regular expression

…mented original regular expressions

martin-honnen · 2021-06-26T08:28:37Z

@joshgoebel , do I need to request a review from a certain person or team or will someone pick up this pull request automatically and do the required review?

joshgoebel · 2021-06-29T20:54:50Z

I'm worried this might fall into the same category as: #2756

I wonder if some of the newer \p{} UTF-8 features aren't the more correct way to handle this, but there are observed performance issues (see issue). There are also performance implications of making these rules more complex. I realize this may be "technically" correct, but it's hard to justify if it makes things slower for 99% of people not using UTF-8 in tagnames.

Before even considering something like this I'd want to consider alternative approaches... what about splitting the rule:

<A-Z (same rules we use now, relevance 1)
<[no spaces *][space] (relevance 0, but highlights ALL contiguous characters without resorting to complex UTF-8 regex)

Auto-detect will continue to work for ASCII HTML (which is most common) but highlighting would also still work on unicode.

Thoughts?

joshgoebel · 2021-06-29T20:55:42Z

What is an example where the XML_IDENT_RE change matters? I only see TAG_NAME tests in your test changes.

martin-honnen · 2021-06-29T21:27:40Z

The rule change for XML_IDENT_RE ensures attribute names using non-ASCII letters are marked up correctly, currently

var result = hljs.highlight('<producto categoría="mueble">mesa</producto>', {language: 'xml'}).value;

gives

<span class="hljs-tag">&lt;<span class="hljs-name">producto</span> <span class="hljs-attr">categor</span>í<span class="hljs-attr">a</span>=<span class="hljs-string">&quot;mueble&quot;</span>&gt;</span>mesa<span class="hljs-tag">&lt;/<span class="hljs-name">producto</span>&gt;</span>

joshgoebel · 2021-06-29T22:24:50Z

XML_IDENT_RE ensures attribute names

Ah yes, I'd imagine this could be handled the same way as I suggest for tag names...

[a-z]= simple rule for relevance (with ASCII attributes)
[not whitespace]= for 100% highlighting coverage, but no relevance

Probably should be using the new multi-scope stuff for attributes anyways. Until we have performance regressions tests for this sort of thing I'm much more comfortable finding ways to avoid all these UTF-8 variants in cases where that's simple to do, as it seems to be here.

joshgoebel · 2021-06-29T22:40:02Z

Example:

      {
        scope: 'tag',
        beginScope: {
          2: 'name'
        },
        variants: [
          {
            begin: [
              /<\//,
              TAG_NAME_RE, // handle ASCII tag names
              />/
            ],
          },
          {
            relevance: 0,
            begin: [
              /<\//,
              /\S+/, // handle non-ASCII tag names
              />/
            ],
          }
        ]
      }

Though I'm not 100% sure we need this dual-mode for closing tags. That's the idea though, and an example of multi-scope.

joshgoebel · 2021-07-03T01:16:28Z

@martin-honnen Would you want to make another pass at this going the route I suggested - or should we close this and leave the issue open until I or someone else can take a look?

martin-honnen · 2021-07-03T06:46:35Z

I haven't been able to think your suggestion through and test it as it tackles the problem from a quite different point of view, might be a pragmatic way. So if you don't want to take the pull request for performance problems the longish but precise regular expressions might cause, close it, I think. And see whether you or someone else can fix the issue using another route.

joshgoebel · 2021-07-26T21:24:59Z

This may want to wait until after we see how #3280 goes... perhaps once we have some benchmarks we can benchmark this before/after and see if the performance is comparable or not... if my fears of performance changes are unfounded then this could be fine as-is...

Indeed using regex (as in this PR) is a more "proper" approach - so if we can have our cake and eat it too we certainly will want to... :-)

joshgoebel · 2021-08-28T16:34:27Z

Ok, since #3280 was ~~merged~~ approved for merging... my first question here is can we use the new UTF-8 regex stuff instead of these super long regex forms?

martin-honnen · 2021-08-28T17:10:58Z

I think my initial attempt was along the lines of e.g. var xmlName = /^\p{L}[\p{L}0-9\-.]*(:[\p{L}0-9\-.]+)?$/u which should do more justice to non-ASCII letters than the existing regular expression while perhaps not being completely in sync with that the XML spec considers a name letter.

I have asked some other JavaScript/XML based implementers what they use and whether the \p{L} based approach would suffice, none was willing, as least as I recall, to judge the compliance issue, the current practice seems to be something like https://github.com/bwrrp/slimdom.js/blob/main/src/util/namespaceHelpers.ts#L132 which uses a long regex.

But I think for the syntax highlighter, if we can allow Unicode letters instead of ASCII letters, the detrimental effect current XML vocabularies with non-ASCII letters see, will be gone and the number of users that throw that exotic characters at the highlighter that the XML names and the Unicode letter categories distinguish them are probably very low.

Thus, I think avoiding the long regex but relying on Unicode categories and Javascript RegEx Unicode support, is a good compromise.

joshgoebel · 2021-08-28T17:32:16Z

I'll try to push #3280 to home plate soon... then we can switch this PR to UTF-8 regex and rebase it. The comments about the "official" specs can remain and we'll explain in comment that for simplicity we're using a much simpler regex that should work out the same in practice.

If this fails to be the case in the future we can re-review.

The test should be expanded though to include both tag names and attribute names.
An actual real example would be better than a random character inserted (can we just copy a UTF-8 Spanish word or something from another source?)

martin-honnen · 2021-08-28T17:41:33Z

I am not sure what you consider a real example or where you see a random character, categoría is the correct and official spelling of a real Spanish word: https://dle.rae.es/categor%C3%ADa?m=form.

joshgoebel · 2021-08-28T17:42:24Z

Oh, my apologies then. I guess I thought GitHub would do a better job of displaying it properly and was making assumptions based on it's ugly appearance in the review screen.

joshgoebel · 2021-09-01T04:34:09Z

#3280 has been merged so I think this could move forward now.

joshgoebel · 2021-09-29T14:00:07Z

Closing for inactivity. Happy to pursue this again in the future if you ever want to reopen and have time to work on.

martin-honnen added 5 commits June 24, 2021 19:17

test case with non-ascii character in XML element name

5abdcd1

Fixing TAG_NAME_RE in xml.js to include Unicode ranges the XML spec a…

686df4f

…llows in names. Adding test case.

Fixing XML_IDENT_RE in xml.js to also include non-ASCII letters the X…

9120095

…ML spec allows as Unicode ranges in the regular expression

Some clean-up of explaining comments and removal of previously outcom…

931aa55

…mented original regular expressions

Updating CHANGES.md

079105d

joshgoebel changed the title ~~Fix for issue https://github.com/highlightjs/highlight.js/issues/3256~~ enh(xml) Allow Unicode attribute and tag names Jul 26, 2021

austin-schick mentioned this pull request Jul 27, 2021

enh(python) Add support for unicode identifiers #3280

Merged

2 tasks

joshgoebel closed this Sep 29, 2021

joshgoebel mentioned this pull request Mar 1, 2022

(XML) Unicode letters should be recognized in element and attribute names #3256

Closed

enh(xml) Allow Unicode attribute and tag names #3257

enh(xml) Allow Unicode attribute and tag names #3257

Uh oh!

Conversation

martin-honnen commented Jun 25, 2021

Changes

Checklist

Uh oh!

martin-honnen commented Jun 26, 2021

Uh oh!

joshgoebel commented Jun 29, 2021

Uh oh!

joshgoebel commented Jun 29, 2021

Uh oh!

martin-honnen commented Jun 29, 2021

Uh oh!

joshgoebel commented Jun 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joshgoebel commented Jun 29, 2021

Uh oh!

joshgoebel commented Jul 3, 2021

Uh oh!

martin-honnen commented Jul 3, 2021

Uh oh!

joshgoebel commented Jul 26, 2021

Uh oh!

joshgoebel commented Aug 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martin-honnen commented Aug 28, 2021

Uh oh!

joshgoebel commented Aug 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martin-honnen commented Aug 28, 2021

Uh oh!

joshgoebel commented Aug 28, 2021

Uh oh!

joshgoebel commented Sep 1, 2021

Uh oh!

joshgoebel commented Sep 29, 2021

Uh oh!

Uh oh!

joshgoebel commented Jun 29, 2021 •

edited

Loading

joshgoebel commented Aug 28, 2021 •

edited

Loading

joshgoebel commented Aug 28, 2021 •

edited

Loading