Skip to content

Conversation

ggrossetie
Copy link
Contributor

@ggrossetie ggrossetie commented Nov 15, 2020

Changes

resolves #2869

Checklist

  • Added markup tests
  • Updated the changelog at CHANGES.md
  • Added myself to AUTHORS.txt, under Contributors

// inline unconstrained strong
{
className: 'strong',
begin: /\*{2}/,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not possible to escape an * here?

Copy link
Contributor Author

@ggrossetie ggrossetie Nov 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can escape but I'm not sure I quite understand the existing rule:

// allow escaped asterisk followed by word char
contains: [
  {
    begin: '\\\\*\\w',
    relevance: 0
  }
]

The following should not be highlighted:

this is a \*strong* word.
this is a str*\*on**g word.

But the last example is pretty uncommon (and fragile). In this case, it would be better to use substitution:

:stars: **
this is a str{stars}on{stars}g word.

Since the AsciiDoc specification is underway, I would ignore these edge cases for now because I'm not even sure what is the expected result.

Here's a few examples, with the input and the rendered result:

escaped

more

As you can see it won't be easy to implement the current behavior...

@mojavelinux thoughts? maybe that's the reason why you didn't include unconstrained syntax in the first place 😉

Copy link
Member

@joshgoebel joshgoebel Nov 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I quite understand the existing rule:

It's simply allowing \* inside of bold so that it doesn't trigger "end bold". I'm not sure why the requirement of "one character after"... that might be specific to asciidoc. I brought it up since we already handle this for unconstrained so it seems we should also for constrained unless for some reason it's only really used in one type but not the other.

with the input and the rendered result:

We're less interested in the rendered result. We are not a rendering engine. We're trying to highlight the semantics of the language. So strong might not be strong at all, it might be red or blue. We aren't trying to accurately flag what should be strong though, but it's best to remember we'r'e highlighting, not rendering - we're never adding or removing any text.

this is a str**o**n**g word.

This would be broken with your current PR as the **g word whatever would be bold (if we can find two linebreaks anywhere else). I wonder if perhaps we don't need to handle \n and \n\n separately if they have entirely different behavior?

How does it know everything after the 3rd ** isn't bold?

this is a \*strong* word.

We'd need to add a general rule to eat a \* sequence then when it's found to avoid triggering any of the other rules. This should be a test case no doubt.


We should really start here with which cases are reasonable to handle, add markup tests. Then see if we can make those tests green. :)

Copy link
Member

@joshgoebel joshgoebel Nov 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The markup tests should include the negative/edge cases too, to make sure we properly handle them, etc...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should really start here with which cases are reasonable to handle, add markup tests. Then see if we can make those tests green. :)

💯 I will add tests.

We're less interested in the rendered result. We are not a rendering engine.

I know, it was just easier to share every variations.

The markup tests should include the negative/edge cases too, to make sure we properly handle them, etc...

👍

Comment on lines 108 to 116
{
className: 'strong',
begin: /\*{2}/,
end: /(\n{2}|\*{2})/
},
Copy link
Member

@joshgoebel joshgoebel Nov 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think we're going to need a variant here for the \n possibility, it's too "wide"... it can match too much, making it not worthy of relevance. ** is used in many languages as a math operation.

Suggested change
{
className: 'strong',
begin: /\*{2}/,
end: /(\n{2}|\*{2})/
},
{
className: 'strong',
begin: /\*{2}/,
variants: [
{ end: /\n{2}/, relevance: 0 },
{ end: /\*{2}/ }
]
},

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, although now that I think of it ** anything ** is also VERY broad... can these be used without text? ALl your examples show them surrounding text immediately if so we should add that as a condition here with a look ahead:

begin: /\*{2}(?=\w+)/

Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since it's unconstrained, you can have spaces, a single newline...:

A sentence ** with words, and more words!  
using bold style :)**.

result

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I glanced at your link now and roughly understand they are different. Here I'm speaking only of auto-detect... we don't just need to know how to highlight Asciidoc but we also have to avoid confusing it with say Javascript... so every rule (that's too generic) needs to have it's relevance tuned or eliminated to avoid false positives. For example a JS snippet with tons of ** math formulas might be flagged as Asciidoc... and that should be avoided if possible.

But lets focus on getting them working first I guess. :)

@ggrossetie ggrossetie force-pushed the asciidoc-unconstrained-quote branch from 545a1b5 to 3be39c8 Compare November 17, 2020 10:44
begin: '(?<!\\\\)\\B\\*(?![\\*\\s])',
end: '(\\n{2}|\\*)',
begin: /(?<!\\)\B\*\w/,
end: /\*/,
Copy link
Member

@joshgoebel joshgoebel Nov 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, can't use look-behind, not supported by all browsers yet. Sorry. You could look at making this a chain (with starts) or rebase onto #2834 (beforeMatch) and use the syntactic sugar to the same end. But none of this is true look-behind... we can only look-forward at this time. So in order to do something like this you'll need to match/consume the prior character (ie. another mode can't have already consumed it).

Copy link
Member

@joshgoebel joshgoebel Nov 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or rebase onto #2834 and use the syntactic sugar to the same end.

That's what I'd recommend if you know how. I don't think that stuff is going to make it into 10.4, but it should make it into 10.5 and will make a lot of things like this a lot nicer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or like I suggested earlier capture the \* pattern as it's own previous rule (to prevent this rule from firing).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or like I suggested earlier capture the * pattern as it's own previous rule (to prevent this rule from firing).

👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that I'm only updating strong rules.
Monospaced and emphasis are quite similar but first I want to make strong rules right (then apply the same rules to other formatting marks).

@ggrossetie ggrossetie force-pushed the asciidoc-unconstrained-quote branch from 36caecf to 1a36281 Compare November 17, 2020 13:56
begin: '\\B\\*(?![\\*\\s])',
end: '(\\n{2}|\\*)',
// must not be preceded by the escape character \
begin: /\B\*(\w\n?\w)+(?!\n\n)/,
Copy link
Member

@joshgoebel joshgoebel Nov 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks strange. What is the exact intentional of this rule and the negative look-ahead? Trying to prevent *...* from wrapping across paragraphs? IE:

*bob*

vs

*bob was

a wild man*

Copy link
Contributor Author

@ggrossetie ggrossetie Nov 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct, an empty line (two consecutive newline characters) indicates a new paragraph.

not bold

*bob was

a wild man

bold

*bob was
a wild man*

The last example is equivalent to:

*bob was a wild man*

Copy link
Member

@joshgoebel joshgoebel Nov 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One note here then I'll shut up until you ping me. We also have a Slack now if that might interest you. (see top of README).

       begin: /\B\*(\w\n?\w)+(?!\n\n)/,

This is probably way to narrow (and on the other end grabbing way to much). I presume that other things can exist inside bold - like links, italics, other inline parts of the grammar. So this rule would fail to take any of that into account (from what I can see). Plus if you keep extending it to match more and more you run the risk of it matching things that aren't really the "begin" and "end" - which means because this rule ate them, nothing else can match them. IE, optimally begin and end should "eat"/match as little as possible. Usually (for begin) this is accomplished with look-ahead.

What you really want here (I think) is:

begin: /\B\*/,
unlessMiddleContains: /\n\n/ // sadly, this doesn't exist
contains: [ // ... ],
end : /\*/

We have illegal, but that will just abort. We can't "backtrack". So usually what you do (in simpler cases is try to cover the WHOLE expression with a lookahead):

begin: concat(/\B\*/, lookahead(contentWithoutAsteriskOrPairOfNewlinesEndingInAsterisk)),
contains: [ //... ],
end : /\*/

IE, your begin does all the work... I'm just not sure what that regex might look like off the top of my head. :-)

Maybe something like (multi-line):

/\*(([^*\n\\]|\\[^\n])+\n)+([^*\n\\]|\\[^\n])*\*/

It might be simple to have a separate rule for single-line vs trying to cram it into a single regex.

Copy link
Member

@joshgoebel joshgoebel Nov 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

 begin: /\B\*(\w\n?\w)+(?!\n\n)/,

Also since you have no way to pin this I'm pretty sure it will just rewind:

*this is bold
but you'd think

that it should not be...

[I believe] the regex will rush forward to think, see the new paragraph (not a match!) and then just rewind to bold\n, which is a match, so the bold rule is a match.

@ggrossetie
Copy link
Contributor Author

Hey @joshgoebel thanks for your prompt reviews!
I'm still experimenting and trying various regular expressions and I don't want to bother you.
Maybe I should mark this pull request as draft for now?

@joshgoebel
Copy link
Member

I was hoping to maybe prevent you from traveling down some dead-end roads but if you just want to explore on your own and then ping me later, that's fine. One concern I always have is that it's possible to do many things with our engine that we don't actually encourage and there is a certain line where you start "parsing" code instead of merely pattern matching it. And we try not to cross into that territory.

I worry getting the rules perfect around a single * might be hard and that makes me wonder how badly we were doing it before. :-)

You picked a non-trivial problem for starters. :-)

@ggrossetie
Copy link
Contributor Author

ggrossetie commented Nov 17, 2020

I was hoping to maybe prevent you from traveling down some dead-end roads but if you just want to explore on your own and then ping me later, that's fine.
One concern I always have is that it's possible to do many things with our engine that we don't actually encourage and there is a certain line where you start "parsing" code instead of merely pattern matching it. And we try not to cross into that territory.

You are probably right.
I will keep the pull request active and explicitly request your review when there's something interesting. Are you OK with that?

I worry getting the rules perfect around a single * might be hard

I adopted a new strategy!
I've implemented basic rules without all the subtle variations. I think the result is quite good 😄
The regular expressions are relatively simple to read and all the tests are passing.

You picked a non-trivial problem for starters. :-)

I like challenges 🥇

that makes me wonder how badly we were doing it before. :-)

At least now we have tests so we can quickly assert what is working and what is not 😆

{
className: 'strong',
// must not precede or follow a word character
begin: /\B\*\w+\*\B/,
Copy link
Member

@joshgoebel joshgoebel Nov 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too tight. " is not \w.

I know they say that *"bob loves jemma"*.

Why not [^\n]*?

Copy link
Member

@joshgoebel joshgoebel Nov 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure these single expressions aren't going to cut it... we'll need separate begin and end tags to handle things like (emphasis inside bold):

Does *Bob _love_ Gemma*?

Which we should handle (we do for Markdown). Or even the proper handling of \* inside bold, which we supported previously. Generally we try not to merge a PR that removes features unless there are no other options.

Copy link
Member

@joshgoebel joshgoebel Nov 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In generally anytime you try and match the middle it's hard. In case like this you'd want to be super flexible, like I suggest earlier... for example if we're matching a single line bold:

// doesn't match all sorts of things, `"` just to name one.
begin: /\B\*\w+\*\B/,
// matches everything except a new line, the one thing we know isn't allowed
begin: /\B\*[^\n]+?\*\B/,

But again you'd have to split this into begin end still:

begin: /\B\*(?=[^\n]+?\*\B)/, // look-ahead to make sure it closes
contains: OTHER_STUFF,
end: /\*\B/

Of course this gets harder since the lookahead regex also has to correctly handle escapes, like \*...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course this gets harder since the lookahead regex also has to correctly handle escapes, like *...

Yes that's why I thought it would be simpler to match everything in begin... 😒

we'll need separate begin and end tags to handle things like (emphasis inside bold)

Unless I missed something, it currently does not work.
It would be nice to support it but we should probably to do it in another pull request.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or even the proper handling of * inside bold, which we supported previously.

I removed it because I don't see why we need it:

This stuff sure is *gr*e*e*dy*!

This stuff sure is *gr*e*e*dy*!

I'm not convinced yet we can solve the * problem 100%,

I don't think it will be possible unless we create a parser, that's why I'm pruning uncommon syntax and/or syntax with undefined behavior.

I think it's important that This is not \*strong*! works.
Please note that This is *str\*ong*! will produce: This is *str\*ong*!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably port tests from Asciidoctor: https://github.com/asciidoctor/asciidoctor/blob/master/test/substitutions_test.rb

\**Git**Hub

Result: *Git*Hub

\*a few strong words*

Result: *a few strong words*

As you can see \* inside bold is not tested.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or even the proper handling of * inside bold, which we supported previously.

I was referring to \*, sorry for not being clearer. Someone went to the trouble of supporting it before making me reason perhaps it's more important than you think. :-) Obviously * inside bold doesn't require special treatment if it isn't triggering the end of the bold rule or causing any issues.

Yes that's why I thought it would be simpler to match everything in begin...

Which is the right idea sort of but you have to cover the whole expression and do it with a look-ahead... so that the actual MATCH portion of begin and end is "" and "". But that's not the largest issue - see my points below on regressions.

I don't think it will be possible unless we create a parser

The regex I gave you was working for simple cases, did you try it or play around with it in regex101? See the attached snap...

It would be nice to support it but we should probably to do it in another pull request.

Oh I'm not trying to say accomplish everything in this PR (far from it!), but I'm also not wanting to merge anything that I know is still broken or is going to make future work harder. So regarding broken lets go back to my prior example:

Does *Bob _love_ Gemma*?

This probably sort of works thanks to _ being in \w... So my apologies for not using a better example:

Does *Bob -love- Gemma*? (broken -)
Does *Bob "love" Gemma*? (broken ")
Does *Bob love Gemma? Truly?* (broken ?)

Your PR would break all these that worked previously, would it not? It's possible that I could be persuaded to drop \* for the short-term, but even if that were true I still think there is work to be done here before this would be mergeable. Go back and look at my comments regarding [^\n] vs \w, etc... if we resolved these type issues we might have something workable here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to *, sorry for not being clearer. Someone went to the trouble of supporting it before making me reason perhaps it's more important than you think. :-) Obviously * inside bold doesn't require special treatment if it isn't triggering the end of the bold rule or causing any issues.

Let me check with Dan (who wrote the regular expression 8 years ago). Maybe at the time constrained formatting was not greedy... or maybe I'm missing something!

The regex I gave you was working for simple cases, did you try it or play around with it in regex101? See the attached snap...

Not yet, I will give it a try tonight.

So regarding broken lets go back to my prior example...
This probably sort of works thanks to _ being in \w... So my apologies for not using a better example...
Your PR would break all these that worked previously, would it not?

Let me try these examples on master but I'm pretty sure they do not work as expected today.
But rest assured, I plan to make them work but one step at a time 😉

Go back and look at my comments regarding [^\n] vs \w, etc... if we resolved these type issues we might have something workable here.

👍

{
className: 'strong',
// must not precede or follow a word character
begin: /\B\*\w+\*\B/,
Copy link
Member

@joshgoebel joshgoebel Nov 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In generally anytime you try and match the middle it's hard. In case like this you'd want to be super flexible, like I suggest earlier... for example if we're matching a single line bold:

// doesn't match all sorts of things, `"` just to name one.
begin: /\B\*\w+\*\B/,
// matches everything except a new line, the one thing we know isn't allowed
begin: /\B\*[^\n]+?\*\B/,

But again you'd have to split this into begin end still:

begin: /\B\*(?=[^\n]+?\*\B)/, // look-ahead to make sure it closes
contains: OTHER_STUFF,
end: /\*\B/

Of course this gets harder since the lookahead regex also has to correctly handle escapes, like \*...

// inline unconstrained strong (multi-line)
{
className: 'strong',
begin: /\*{2}(\s\S+)*\*{2}/,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This of course as the same issues with begin/end... we need access to contains so we can properly highlight markup that appears inside of bold.

@joshgoebel
Copy link
Member

joshgoebel commented Nov 17, 2020

The regular expressions are relatively simple to read and all the tests are passing.

Simple is good, but I think it only works so well because you've just covered the very basic cases. I'm not convinced yet we can solve the * problem 100%, but if we can it's going to be with something gnarly like I mentioned earlier:

/\*(([^*\n\\]|\\[^\n])+\n)+([^*\n\\]|\\[^\n])*\*/

IE matching:

CONTENT = [ content on a line (and handling escapes), excluding line endings]
LINE = CONTENT "\n"

"*"LINE
LINE(possibly repeated multiple time)
CONTENT"*"

And a second variant for:

"*"CONTENT"*"

ie:

begin: concat(/\*/, lookahead(gnarly_regex)),
contains: [
  escape,  emphasis, etc.
]
end: /\*/

The begin look-ahead does the validation, but yet begin only actually matches/returns * (as it should), then contains has to handle any escapes, etc (in addition to other rules that could come in there)... so that when your find end rule sees a * you know it's a good one.

You can play around on https://regex101.com/. It's super useful for stuff like this.

@joshgoebel
Copy link
Member

Made it a bit easier to see with named groups:

Screen Shot 2020-11-17 at 2 59 50 PM

@joshgoebel
Copy link
Member

joshgoebel commented Nov 18, 2020

So if we just want to focus on minimum to merge I think we need to focus on:

      // inline contrained string (single line)
      {
        className: 'strong',
        // must not precede or follow a word character
        begin: /\B\*\w+\*\B/,
      },
      // inline constrained strong (multi-line)
      {
        className: 'strong',
        // must not precede or follow a word character
        begin: /\B\*(\w\n?)*\*/,
      },

We need more tests cases and those rules need to be widened so as to not fail with simple textual content in between the ** pair (examples above in my most recent responses). Once that's working there are some minor consistency issues we could go over or I can maybe just clean up myself.

If you only want to exclude one or two thigns focus on negative matching, not positive so for example I think the first rule would be solved with:

begin: /\B\*[^\n]+\*\B/

Would it not?

Comment on lines 18 to 28
function unpack_hex_range(str) {
return str.replace(HEX_RANGE_RX, function (_, p1, p2) {
return `\\u${p1}${p2 ? `-\\u${p2}` : ''}`
})
}

// Unicode property aliases (P_<name> constant is equivalent to \p{<name>})
const P_L = 'A-Za-z' + (unpack_hex_range('00AA00B500BA00C0-00D600D8-00F600F8-02C102C6-02D102E0-02E402EC02EE0370-037403760377037A-037D037F03860388-038A038C038E-03A103A3-03F503F7-0481048A-052F0531-055605590561-058705D0-05EA05F0-05F20620-064A066E066F0671-06D306D506E506E606EE06EF06FA-06FC06FF07100712-072F074D-07A507B107CA-07EA07F407F507FA0800-0815081A082408280840-085808A0-08B20904-0939093D09500958-09610971-09800985-098C098F09900993-09A809AA-09B009B209B6-09B909BD09CE09DC09DD09DF-09E109F009F10A05-0A0A0A0F0A100A13-0A280A2A-0A300A320A330A350A360A380A390A59-0A5C0A5E0A72-0A740A85-0A8D0A8F-0A910A93-0AA80AAA-0AB00AB20AB30AB5-0AB90ABD0AD00AE00AE10B05-0B0C0B0F0B100B13-0B280B2A-0B300B320B330B35-0B390B3D0B5C0B5D0B5F-0B610B710B830B85-0B8A0B8E-0B900B92-0B950B990B9A0B9C0B9E0B9F0BA30BA40BA8-0BAA0BAE-0BB90BD00C05-0C0C0C0E-0C100C12-0C280C2A-0C390C3D0C580C590C600C610C85-0C8C0C8E-0C900C92-0CA80CAA-0CB30CB5-0CB90CBD0CDE0CE00CE10CF10CF20D05-0D0C0D0E-0D100D12-0D3A0D3D0D4E0D600D610D7A-0D7F0D85-0D960D9A-0DB10DB3-0DBB0DBD0DC0-0DC60E01-0E300E320E330E40-0E460E810E820E840E870E880E8A0E8D0E94-0E970E99-0E9F0EA1-0EA30EA50EA70EAA0EAB0EAD-0EB00EB20EB30EBD0EC0-0EC40EC60EDC-0EDF0F000F40-0F470F49-0F6C0F88-0F8C1000-102A103F1050-1055105A-105D106110651066106E-10701075-1081108E10A0-10C510C710CD10D0-10FA10FC-1248124A-124D1250-12561258125A-125D1260-1288128A-128D1290-12B012B2-12B512B8-12BE12C012C2-12C512C8-12D612D8-13101312-13151318-135A1380-138F13A0-13F41401-166C166F-167F1681-169A16A0-16EA16F1-16F81700-170C170E-17111720-17311740-17511760-176C176E-17701780-17B317D717DC1820-18771880-18A818AA18B0-18F51900-191E1950-196D1970-19741980-19AB19C1-19C71A00-1A161A20-1A541AA71B05-1B331B45-1B4B1B83-1BA01BAE1BAF1BBA-1BE51C00-1C231C4D-1C4F1C5A-1C7D1CE9-1CEC1CEE-1CF11CF51CF61D00-1DBF1E00-1F151F18-1F1D1F20-1F451F48-1F4D1F50-1F571F591F5B1F5D1F5F-1F7D1F80-1FB41FB6-1FBC1FBE1FC2-1FC41FC6-1FCC1FD0-1FD31FD6-1FDB1FE0-1FEC1FF2-1FF41FF6-1FFC2071207F2090-209C21022107210A-211321152119-211D212421262128212A-212D212F-2139213C-213F2145-2149214E218321842C00-2C2E2C30-2C5E2C60-2CE42CEB-2CEE2CF22CF32D00-2D252D272D2D2D30-2D672D6F2D80-2D962DA0-2DA62DA8-2DAE2DB0-2DB62DB8-2DBE2DC0-2DC62DC8-2DCE2DD0-2DD62DD8-2DDE2E2F300530063031-3035303B303C3041-3096309D-309F30A1-30FA30FC-30FF3105-312D3131-318E31A0-31BA31F0-31FF3400-4DB54E00-9FCCA000-A48CA4D0-A4FDA500-A60CA610-A61FA62AA62BA640-A66EA67F-A69DA6A0-A6E5A717-A71FA722-A788A78B-A78EA790-A7ADA7B0A7B1A7F7-A801A803-A805A807-A80AA80C-A822A840-A873A882-A8B3A8F2-A8F7A8FBA90A-A925A930-A946A960-A97CA984-A9B2A9CFA9E0-A9E4A9E6-A9EFA9FA-A9FEAA00-AA28AA40-AA42AA44-AA4BAA60-AA76AA7AAA7E-AAAFAAB1AAB5AAB6AAB9-AABDAAC0AAC2AADB-AADDAAE0-AAEAAAF2-AAF4AB01-AB06AB09-AB0EAB11-AB16AB20-AB26AB28-AB2EAB30-AB5AAB5C-AB5FAB64AB65ABC0-ABE2AC00-D7A3D7B0-D7C6D7CB-D7FBF900-FA6DFA70-FAD9FB00-FB06FB13-FB17FB1DFB1F-FB28FB2A-FB36FB38-FB3CFB3EFB40FB41FB43FB44FB46-FBB1FBD3-FD3DFD50-FD8FFD92-FDC7FDF0-FDFBFE70-FE74FE76-FEFCFF21-FF3AFF41-FF5AFF66-FFBEFFC2-FFC7FFCA-FFCFFFD2-FFD7FFDA-FFDC'))
const P_Nl = unpack_hex_range('16EE-16F02160-21822185-218830073021-30293038-303AA6E6-A6EF')
const P_Nd = '0-9' + (unpack_hex_range('0660-066906F0-06F907C0-07C90966-096F09E6-09EF0A66-0A6F0AE6-0AEF0B66-0B6F0BE6-0BEF0C66-0C6F0CE6-0CEF0D66-0D6F0DE6-0DEF0E50-0E590ED0-0ED90F20-0F291040-10491090-109917E0-17E91810-18191946-194F19D0-19D91A80-1A891A90-1A991B50-1B591BB0-1BB91C40-1C491C50-1C59A620-A629A8D0-A8D9A900-A909A9D0-A9D9A9F0-A9F9AA50-AA59ABF0-ABF9FF10-FF19'))
const P_Pc = unpack_hex_range('005F203F20402054FE33FE34FE4D-FE4FFF3F')
Copy link
Member

@joshgoebel joshgoebel Nov 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow. Two big steps forward one small step back. :-) This might be a bit much. :-)

Context: #2756

We're still up in the air on UTF8 but if we're headed there we should do it properly, not with this kind of hack.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could leave a TODO comment if you wanted.

Copy link
Contributor Author

@ggrossetie ggrossetie Nov 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will use \w and remove the associated test case.
We are using these unicode ranges in Asciidoctor.js to translate \p{Word} (Ruby).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we have access to the /u flag in JS and if we're going to start matching \p stuff that's what we should be using, but right now there are some performance concerns. If you know anything about it and wanted to contribute to that thread help would be welcome. :-)

className: 'strong',
// must not precede or follow a word character
begin: /\B\*\w+\*\B/,
begin: `\\B\\*(\\S|\\S${CC_ANY}*?\\S)\\*(?!${CG_WORD})`,
Copy link
Member

@joshgoebel joshgoebel Nov 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pull out all the UTF-8 and just let this be \W for now? (vs ! CG_WORD)

// must ignore until the next formatting marks
// this rule is not 100% compliant with Asciidoctor but we are entering undefined behavior territory...
{
begin: /\\\\\*{2}[^\n]*\*{2}/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait does this syntax literally mean "it's marked as bold, but ignore it and don't make it bold"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also shouldn't this be CC_ANY?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh and you don't need to turn them all into strings either, just use regex.concat.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait does this syntax literally mean "it's marked as bold, but ignore it and don't make it bold"?

It's another guard when unconstrained bold is escaped using \\. Take a look at the test case:

\\**--anything goes **

I've also added an even more complex test case:

\\**--anything goes ** with * a **lot** of stars**

Also shouldn't this be CC_ANY?

Yes

Oh and you don't need to turn them all into strings either, just use regex.concat.

Oh really? Could you please provide an example, I've never seen regex.concat 🤯

Copy link
Member

@joshgoebel joshgoebel Nov 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's in lib/regex.js... just utility functions. There are a few.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's another guard when unconstrained bold is escaped using \. Take a look at the test case:

OK if really it's like a paired "don't apply any highlighting on the inside", ie "anything goes"... (kind of like a comment in a programming language) but why are there 3 variants of it? :-) Asciidoc is strange. :)

@joshgoebel joshgoebel force-pushed the asciidoc-unconstrained-quote branch from 2b13bf1 to 0f1f719 Compare November 18, 2020 22:04
{
className: 'strong',
// must not precede or follow a word character
begin: /\B\*(\w\n?)*\*(?!\w)/,
Copy link
Member

@joshgoebel joshgoebel Nov 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm so confused about constrains vs unconstrained... shouldn't this work:

*does bob
-really-
know best?*

I'm wondering if perhaps this is really just the same as the ** except for the before/end conditions? If so then that regex would probably work here also I think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm so confused about constrains vs unconstrained... shouldn't this work:

*does bob
-really-
know best?*

The above should produce bold.

I'm wondering if perhaps this is really just the same as the ** except for the before/end conditions? If so then that regex would probably work here also I think?

Close but not quite the same: https://asciidoctor.org/docs/user-manual/#when-should-i-use-unconstrained-quotes

Consider the following questions:
Is there a letter, number, underscore directly outside the formatting marks (on either side)?
Is there a colon, semi-colon, or closing curly bracket directly before the starting formatting mark?
Is there a space directly inside of the formatting mark?
If you answered “yes” to any of these questions, you need to switch to unconstrained (double formatting) quotes.

See: https://github.com/asciidoctor/asciidoctor/blob/dd843d59b1e49f29d68386bfdf0438ce11b42577/lib/asciidoctor.rb#L447-L449
Please keep in mind that we are also doing some work in https://github.com/asciidoctor/asciidoctor/blob/dd843d59b1e49f29d68386bfdf0438ce11b42577/lib/asciidoctor/substitutors.rb#L75

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also using a less greedy regular expression than Asciidoctor because we want to stop when the paragraph ends.

Copy link
Member

@joshgoebel joshgoebel Nov 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above should produce bold.

But your rule doesn't match the -, or am I missing something?

Close but not quite the same

Unless I'm missing something all I read there relates directly to what qualifies the beginning and ending of the match though, yes? Matching the "middle" content could use the same (or similar) wider rule so that it wouldn't get hung up on details like - or punctuation, etc. So really we're trying to structure the regex into 3 segments:

  • beginning
  • middle (it seems this should be as WIDE as possible)
  • end

It seems for both * and ** the middle rules are the same, no? Only the begin and end conditions change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But your rule doesn't match the -, or am I missing something?

Oh yes, you're right, sorry I didn't understand.
I've added this case and updated the regular expression accordingly.

Unless I'm missing something all I read there relates directly to what qualifies the beginning and ending of the match though, yes?

No, this rule is about what is inside the formatting marks: "Is there a space directly inside of the formatting mark?"

this is * a bold move*

this is ** a bold move**

Only the second sentence will produce bold.

So really we're trying to structure the regex into 3 segments:

I guess since the formatting marks are delimiters, so:

  • beginning: delimiter
  • middle: some content
  • end: delimiter

In theory, that should be easy but in practice it's not because "*" can be used in different context:

****
sidebar
****

* foo
* bar
** baz

Read the footnotes*.

E = mc*2*

E = *mc*2

Use `/*` and `*/` for multiline comments.

The last example is not correctly interpreted by Asciidoctor 😅

Anyway, we should be careful about "as WIDE as possible".

@ggrossetie ggrossetie force-pushed the asciidoc-unconstrained-quote branch from 23060fe to a5c1993 Compare November 19, 2020 11:10
@ggrossetie ggrossetie force-pushed the asciidoc-unconstrained-quote branch from c12d1c2 to 069e63e Compare November 19, 2020 12:40
@joshgoebel joshgoebel added this to the 10.5 milestone Nov 20, 2020
@joshgoebel joshgoebel force-pushed the asciidoc-unconstrained-quote branch from c3a1a4f to 632195d Compare December 10, 2020 21:34
Copy link
Member

@joshgoebel joshgoebel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add changelog entry please.

@ggrossetie
Copy link
Contributor Author

Add changelog entry please.

@joshgoebel Done.

Once it's merged, I will also open a follow-up pull request to apply the same unconstrained rules to emphasis and monospaced.

@joshgoebel joshgoebel merged commit 0f5f591 into highlightjs:master Dec 11, 2020
@ggrossetie ggrossetie deleted the asciidoc-unconstrained-quote branch December 11, 2020 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

(asciidoc) Unconstrained text formatting is not correctly highlighted
2 participants