-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enh(xml) Allow Unicode attribute and tag names #3257
Conversation
…llows in names. Adding test case.
…ML spec allows as Unicode ranges in the regular expression
…mented original regular expressions
@joshgoebel , do I need to request a review from a certain person or team or will someone pick up this pull request automatically and do the required review? |
I'm worried this might fall into the same category as: #2756 I wonder if some of the newer Before even considering something like this I'd want to consider alternative approaches... what about splitting the rule:
Auto-detect will continue to work for ASCII HTML (which is most common) but highlighting would also still work on unicode. Thoughts? |
What is an example where the XML_IDENT_RE change matters? I only see TAG_NAME tests in your test changes. |
The rule change for XML_IDENT_RE ensures attribute names using non-ASCII letters are marked up correctly, currently
gives
|
Ah yes, I'd imagine this could be handled the same way as I suggest for tag names...
Probably should be using the new multi-scope stuff for attributes anyways. Until we have performance regressions tests for this sort of thing I'm much more comfortable finding ways to avoid all these UTF-8 variants in cases where that's simple to do, as it seems to be here. |
Example:
Though I'm not 100% sure we need this dual-mode for closing tags. That's the idea though, and an example of multi-scope. |
@martin-honnen Would you want to make another pass at this going the route I suggested - or should we close this and leave the issue open until I or someone else can take a look? |
I haven't been able to think your suggestion through and test it as it tackles the problem from a quite different point of view, might be a pragmatic way. So if you don't want to take the pull request for performance problems the longish but precise regular expressions might cause, close it, I think. And see whether you or someone else can fix the issue using another route. |
This may want to wait until after we see how #3280 goes... perhaps once we have some benchmarks we can benchmark this before/after and see if the performance is comparable or not... if my fears of performance changes are unfounded then this could be fine as-is... Indeed using regex (as in this PR) is a more "proper" approach - so if we can have our cake and eat it too we certainly will want to... :-) |
Ok, since #3280 was |
I think my initial attempt was along the lines of e.g. I have asked some other JavaScript/XML based implementers what they use and whether the But I think for the syntax highlighter, if we can allow Unicode letters instead of ASCII letters, the detrimental effect current XML vocabularies with non-ASCII letters see, will be gone and the number of users that throw that exotic characters at the highlighter that the XML names and the Unicode letter categories distinguish them are probably very low. Thus, I think avoiding the long regex but relying on Unicode categories and Javascript RegEx Unicode support, is a good compromise. |
I'll try to push #3280 to home plate soon... then we can switch this PR to UTF-8 regex and rebase it. The comments about the "official" specs can remain and we'll explain in comment that for simplicity we're using a much simpler regex that should work out the same in practice. If this fails to be the case in the future we can re-review.
|
I am not sure what you consider a real example or where you see a random character, |
Oh, my apologies then. I guess I thought GitHub would do a better job of displaying it properly and was making assumptions based on it's ugly appearance in the review screen. |
#3280 has been merged so I think this could move forward now. |
Closing for inactivity. Happy to pursue this again in the future if you ever want to reopen and have time to work on. |
Fixing regular expressions
TAG_NAME_RE
andXML_IDENT_RE
to include non-ASCII letters the XML spec allows as Unicode range\uDDDD-\uDDDD
, as far as JavaScript allows that with four digit hex codesThis is my attempt to fix issue 3256 #3256, instead of using only ASCII letters I fixed the regular expressions to include non-ASCII letters from the XML spec, expressing them as Unicode range
\uDDDD-\uDDDD
Changes
Included non-ASCII letters the XML spec allows as Unicode ranges in regular expressions
TAG_NAME_RE
andXML_IDENT_RE
Checklist
CHANGES.md