Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enh(xml) Allow Unicode attribute and tag names #3257

Closed
wants to merge 5 commits into from

Conversation

martin-honnen
Copy link
Contributor

Fixing regular expressions TAG_NAME_RE and XML_IDENT_RE to include non-ASCII letters the XML spec allows as Unicode range \uDDDD-\uDDDD, as far as JavaScript allows that with four digit hex codes

This is my attempt to fix issue 3256 #3256, instead of using only ASCII letters I fixed the regular expressions to include non-ASCII letters from the XML spec, expressing them as Unicode range \uDDDD-\uDDDD

Changes

Included non-ASCII letters the XML spec allows as Unicode ranges in regular expressions TAG_NAME_RE and XML_IDENT_RE

Checklist

  • Added markup test non-ascii-element-name.txt
  • Updated the changelog at CHANGES.md

@martin-honnen
Copy link
Contributor Author

@joshgoebel , do I need to request a review from a certain person or team or will someone pick up this pull request automatically and do the required review?

@joshgoebel
Copy link
Member

I'm worried this might fall into the same category as: #2756

I wonder if some of the newer \p{} UTF-8 features aren't the more correct way to handle this, but there are observed performance issues (see issue). There are also performance implications of making these rules more complex. I realize this may be "technically" correct, but it's hard to justify if it makes things slower for 99% of people not using UTF-8 in tagnames.

Before even considering something like this I'd want to consider alternative approaches... what about splitting the rule:

<A-Z (same rules we use now, relevance 1)
<[no spaces *][space] (relevance 0, but highlights ALL contiguous characters without resorting to complex UTF-8 regex)

Auto-detect will continue to work for ASCII HTML (which is most common) but highlighting would also still work on unicode.

Thoughts?

@joshgoebel
Copy link
Member

What is an example where the XML_IDENT_RE change matters? I only see TAG_NAME tests in your test changes.

@martin-honnen
Copy link
Contributor Author

The rule change for XML_IDENT_RE ensures attribute names using non-ASCII letters are marked up correctly, currently

var result = hljs.highlight('<producto categoría="mueble">mesa</producto>', {language: 'xml'}).value;

gives

<span class="hljs-tag">&lt;<span class="hljs-name">producto</span> <span class="hljs-attr">categor</span>í<span class="hljs-attr">a</span>=<span class="hljs-string">&quot;mueble&quot;</span>&gt;</span>mesa<span class="hljs-tag">&lt;/<span class="hljs-name">producto</span>&gt;</span>

@joshgoebel
Copy link
Member

joshgoebel commented Jun 29, 2021

XML_IDENT_RE ensures attribute names

Ah yes, I'd imagine this could be handled the same way as I suggest for tag names...

  • [a-z]= simple rule for relevance (with ASCII attributes)
  • [not whitespace]= for 100% highlighting coverage, but no relevance

Probably should be using the new multi-scope stuff for attributes anyways. Until we have performance regressions tests for this sort of thing I'm much more comfortable finding ways to avoid all these UTF-8 variants in cases where that's simple to do, as it seems to be here.

@joshgoebel
Copy link
Member

Example:

      {
        scope: 'tag',
        beginScope: {
          2: 'name'
        },
        variants: [
          {
            begin: [
              /<\//,
              TAG_NAME_RE, // handle ASCII tag names
              />/
            ],
          },
          {
            relevance: 0,
            begin: [
              /<\//,
              /\S+/, // handle non-ASCII tag names
              />/
            ],
          }
        ]
      }

Though I'm not 100% sure we need this dual-mode for closing tags. That's the idea though, and an example of multi-scope.

@joshgoebel
Copy link
Member

@martin-honnen Would you want to make another pass at this going the route I suggested - or should we close this and leave the issue open until I or someone else can take a look?

@martin-honnen
Copy link
Contributor Author

I haven't been able to think your suggestion through and test it as it tackles the problem from a quite different point of view, might be a pragmatic way. So if you don't want to take the pull request for performance problems the longish but precise regular expressions might cause, close it, I think. And see whether you or someone else can fix the issue using another route.

@joshgoebel
Copy link
Member

This may want to wait until after we see how #3280 goes... perhaps once we have some benchmarks we can benchmark this before/after and see if the performance is comparable or not... if my fears of performance changes are unfounded then this could be fine as-is...

Indeed using regex (as in this PR) is a more "proper" approach - so if we can have our cake and eat it too we certainly will want to... :-)

@joshgoebel joshgoebel changed the title Fix for issue https://github.com/highlightjs/highlight.js/issues/3256 enh(xml) Allow Unicode attribute and tag names Jul 26, 2021
@joshgoebel
Copy link
Member

joshgoebel commented Aug 28, 2021

Ok, since #3280 was merged approved for merging... my first question here is can we use the new UTF-8 regex stuff instead of these super long regex forms?

@martin-honnen
Copy link
Contributor Author

I think my initial attempt was along the lines of e.g. var xmlName = /^\p{L}[\p{L}0-9\-.]*(:[\p{L}0-9\-.]+)?$/u which should do more justice to non-ASCII letters than the existing regular expression while perhaps not being completely in sync with that the XML spec considers a name letter.

I have asked some other JavaScript/XML based implementers what they use and whether the \p{L} based approach would suffice, none was willing, as least as I recall, to judge the compliance issue, the current practice seems to be something like https://github.com/bwrrp/slimdom.js/blob/main/src/util/namespaceHelpers.ts#L132 which uses a long regex.

But I think for the syntax highlighter, if we can allow Unicode letters instead of ASCII letters, the detrimental effect current XML vocabularies with non-ASCII letters see, will be gone and the number of users that throw that exotic characters at the highlighter that the XML names and the Unicode letter categories distinguish them are probably very low.

Thus, I think avoiding the long regex but relying on Unicode categories and Javascript RegEx Unicode support, is a good compromise.

@joshgoebel
Copy link
Member

joshgoebel commented Aug 28, 2021

I'll try to push #3280 to home plate soon... then we can switch this PR to UTF-8 regex and rebase it. The comments about the "official" specs can remain and we'll explain in comment that for simplicity we're using a much simpler regex that should work out the same in practice.

If this fails to be the case in the future we can re-review.

  • The test should be expanded though to include both tag names and attribute names.
  • An actual real example would be better than a random character inserted (can we just copy a UTF-8 Spanish word or something from another source?)

@martin-honnen
Copy link
Contributor Author

I am not sure what you consider a real example or where you see a random character, categoría is the correct and official spelling of a real Spanish word: https://dle.rae.es/categor%C3%ADa?m=form.

@joshgoebel
Copy link
Member

Oh, my apologies then. I guess I thought GitHub would do a better job of displaying it properly and was making assumptions based on it's ugly appearance in the review screen.

@joshgoebel
Copy link
Member

#3280 has been merged so I think this could move forward now.

@joshgoebel
Copy link
Member

Closing for inactivity. Happy to pursue this again in the future if you ever want to reopen and have time to work on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants