Skip to content

Conversation

schtandard
Copy link
Contributor

@schtandard schtandard commented Oct 20, 2020

On Mode Chaining

There are two obvious ways of chaining modes, i.e. putting them together in such a way that they are applied consecutively, with no mode being applied more than once.

  1. The first option has the following structure:
    const CHAIN = {
      begin: /(?=A)/,
      contains: [{begin: /A/, endsParent: true}],
      starts: {
        contains: [{begin: /B/, endsParent: true}],
        starts: {
          contains: [{begin: /C/, endsParent: true}],
          starts: {}
        }
      }
    };
    • The outer-most mode could have been started differently. I chose a lookahead for the first contained mode here, because that permits all links in the chain to have the same structure. This also eases the implementation.
    • The trailing starts: {} could have been omitted, but was added because it is present in the simplest implementation.
    • If any link in such a chain is omitted, the next link is applied regardles. That is, AC would be completely matched by this chain (but only the A in AXBC).
  2. The second option has the following structure:
    const CHAIN = {
      begin: /(?=A)/,
      contains: [{
        begin: /A/,
        starts: {
          endsParent: true,
          contains: [{
            begin: /B/,
            starts: {
              endsParent: true,
              contains: [{
                begin: /C/,
                starts: {
                  endsParent: true,
                  contains: []
                }
              }]
            }
          }]
        }
      }]
    };
    • A similar reasoning as above applies to the outermost and innermost layers of this chain.
    • If any link of such a chain is omitted, the entire chain is terminated. That is, only the A in AC would be matched by this chain (as well as in AXBC).

These two types of chain can be combined: The second kind can be contained in the first kind. Motivated by this, I shall call the first kind an outer chain and the second kind an inner chain hereafter. Indicating the desired input syntax, I shall denote the first chain above as [A, B, C] and the second as [[A, B, C]]. A chain containing two (inner) subchains could then be [A, [B, C], [D, E, F]].

Mode Chaining in LaTeX

In (La)TeX, there is often a need for sequential modes. For example, the macro

\section[short form]{long form}

has an optional argument (in brackets: short form) and a mandatory argument (in braces: long form). Also, spaces and a single linebreak (but not two) are allowed before each argument.

This could easily be matched using a chain:

[['\section', spaces], ['[', shortform, ']', spaces], ['{', longform, '}']]

An easy way to input chains like this would provide a simple way of highlighting every part of such a chain.

Changes

I added functions to automatically chain modes in the way outlined above, as well as some utility functions for applying them to highlighting LaTeX. I replaced the verbatim rules, that already used chaining but in a somewhat unclear manner, with equivalent ones using the new chaining functions.

The functions were designed to provide a simple interface, so that the following works as expected:

const A = {begin: /A/};
const B = {begin: /B/};
const C = {begin: /C/};
const D = {begin: /D/};
const E = {begin: /E/};
const F = {begin: /F/};
const CHAIN = MODE_CHAIN([A, [B, C], [D, E, F]])

I believe that this change makes the rules easier to understand and maintain (once the concepts described above are known). It also greatly simplifies adding new rules for highlighting certain elements like sections, environments, document classes or used packages, as was contemplated in #2726.

Checklist

  • Added markup tests, or they don't apply here because...

    The behavior of the grammar is not changed by this pull request.

  • Updated the changelog at CHANGES.md

    Should I do this now or when this is finalized? Or are you taking care of this? I read something about constant merge conflicts..

  • Added myself to AUTHORS.txt, under Contributors

    Same question. Should I already have done this last time around?

@joshgoebel
Copy link
Member

This is a lot to digest. There are a few high level issues here that I think about from the perspective of the core library. First, easy chaining is definitely something we're lacking. Yet I'm also hesitant to pursue a solution that isn't something that I could imagine potentially being rolled forward into a more generic solution that the entire library could use. IE, each grammar (or even a few) building their own quite complex meta-constructs isn't something I want to encourage. And complex chaining definitely crosses over into that territory. It's also quite likely that something of this complexity needs its own specific tests apart from grammar tests.

I'd rather we perhaps try and first imagine the good abstraction (ie, how to describe chains in the grammar) [looking across multiple languages]... did you read the other issue thread on this topic? And then decide if that's best done with meta-modes or if we need to build some new functionality into the parser itself. starts can be one of the more confusing aspects of our rules language and it's only used for chaining because currently it's really the only way possible - and where there is a will people will find a way.

Now I'm not in a hurry to add more and more features to the core parser, but this may be an area where it's really worth looking at doing so - or if there are better building blocks that should exist to support this than starts.

I wonder how much of this could be solved if we simply had a way to color different regex groups differently? Would you need mode chaining at all then?

These two types of chain can be combined: The second kind can be contained in the first kind. Motivated by this, I shall call the first kind an outer chain and the second kind an inner chain hereafter. Indicating the desired input syntax, I shall denote the first chain above as [A, B, C] and the second as [[A, B, C]]. A chain containing two (inner) subchains could then be [A, [B, C], [D, E, F]].

You completely lost me here. Are we saying that "inner" chains must always require each element (or terminate) but that in "outer" chains any element is optional? Or was it the reverse? The inner/outer naming doesn't help in understanding I don't think.

That is, AC would be completely matched by this chain (but only the A in AXBC).

You forgot to mention that in the first form matches can easily repeat also:

ABBCCC

Since contains will just keep looking for the next match until no more can be found.

@joshgoebel
Copy link
Member

joshgoebel commented Oct 20, 2020

I also feel at a gut level that perhaps the chaining implementation should not deal with recursive chaining on it's own... if one wants a chain inside a chain, then write it as such:

var MODE_C = CHAIN(X,Y,Z)
var MATCH = CHAIN(MODE_A, MODE_B, MODE_C)

If there can be different types of chains then each chain should probably express its own type. Just spitballing:

var MODE_C = CHAIN(X,Y,Z, { type: "keep_going"})
var MATCH = CHAIN(MODE_A, MODE_B, MODE_C, { type: "terminate_early"})

I almost always prefer explicit over implicit...

@joshgoebel
Copy link
Member

I also wonder if we're not using the existing grammar to it's full potential... if your rule is only matching a single regex then you don't need a "chain" to say "A, then spaces" (or "spaces then A"), you can do that all in a single rule:

{
  begin: /my expression/,  // the match
  end: /\s*/, // space eater expression,
  excludeEnd: true
}

@joshgoebel
Copy link
Member

Are you sure the ends parents are even required in the second form? The starts chain should really unravel itself when done... the only reason endsParent is needed for contains is because you're using the child AS the parent in the way...

@joshgoebel
Copy link
Member

joshgoebel commented Oct 20, 2020

If we're sticking with what the parser currently has implemented I'm far more partial to the second form - and having a single top level form that's used for chaining... as it's more explicit... every item leads into the next... and then if you want to deal with optional items or repeatable items you do that within the framework of a single form of chaining:

let CHAIN = [A, B, optional(C), repeatable(D)]

Where you'd embed this complexity in the mode themselves. So instead of:

// non repeatable (required, or sequence will short-circuit)
      contains: [{
        begin: /D/,
        starts: {

You'd end up generating something like:

// repeatable (0-x times)
      contains: [{
        begin: /\b|\B/,  // immediate match
        contains: [ { begin: /D/ }]
        starts: {

But this is all encapsulated in the repeatable helper, keeping the chain code itself very, very simple. And of course you could imagine an "at least once" variant as well, though that might be harder to do since you could have an end and you don't have use of starts... that one would require some thought.

@schtandard
Copy link
Contributor Author

This is a lot to digest.

True.

I'd rather we perhaps try and first imagine the good abstraction (ie, how to describe chains in the grammar) [looking across multiple languages]... did you read the other issue thread on this topic?

You mean #1140? I had not, but I did now. Haven't quite wrapped my head around the implications of that proposal, though. I don't think I'm familiar enough with the internals of the parser to understand you suggestion of hacking it in by making highlight take a MODE.

What I did here was just make the best of what's already in the parser (using starts and endsParent).

Now I'm not in a hurry to add more and more features to the core parser, but this may be an area where it's really worth looking at doing so - or if there are better building blocks that should exist to support this than starts.

That's true. I can't imagine anything now from the top of my head, maybe I'll think of something in time..

I wonder how much of this could be solved if we simply had a way to color different regex groups differently? Would you need mode chaining at all then?

This would indeed greatly reduce the need for chaining, but not eliminate it. Regardless, I would say that this is something you should really think about adding to the parser. To me, it was really surprising that it was not already in there. All the information is there (the regex match groups), one just needs to add a way of assigning them classNames.

It would not totally eliminate the need for chaining, though, as some patterns cannot (easily) be matched using regexes. One example are groups of matched braces. Another example is the need for further highlighting inside one of the chain links. For example, in \section[short]{something}, both short and something may contain elements that require highlighting, like control sequences.

These two types of chain can be combined: The second kind can be contained in the first kind. Motivated by this, I shall call the first kind an outer chain and the second kind an inner chain hereafter. Indicating the desired input syntax, I shall denote the first chain above as [A, B, C] and the second as [[A, B, C]]. A chain containing two (inner) subchains could then be [A, [B, C], [D, E, F]].

You completely lost me here. Are we saying that "inner" chains must always require each element (or terminate) but that in "outer" chains any element is optional? Or was it the reverse? The inner/outer naming doesn't help in understanding I don't think.

Yes, in outer chains each link is optional and may be skipped, while inner chains can only be terminated early, but require each link apart from that. It's true that the inner/outer naming doesn't help here. I chose it because inner chains can be inside outer chains but not the other way around.

That is, AC would be completely matched by this chain (but only the A in AXBC).

You forgot to mention that in the first form matches can easily repeat also:

ABBCCC

Since contains will just keep looking for the next match until no more can be found.

No, that's what the endsParens are for. No link in the chain can occur more than once.

To make it explicit, the chain [A, [B, C], [D, E, F]] mentioned above can match any line of the form XYZ where

  • X is A,
  • Y is absent, B or BC and
  • Z is absent, D, DE or DEF.

X cannot be absent, assuming that the chain is started using a lookaead for A, as in the examples above.

@schtandard
Copy link
Contributor Author

I also feel at a gut level that perhaps the chaining implementation should not deal with recursive chaining on it's own... if one wants a chain inside a chain, then write it as such:

var MODE_C = CHAIN(X,Y,Z)
var MATCH = CHAIN(MODE_A, MODE_B, MODE_C)

If there can be different types of chains then each chain should probably express its own type. Just spitballing:

var MODE_C = CHAIN(X,Y,Z, { type: "keep_going"})
var MATCH = CHAIN(MODE_A, MODE_B, MODE_C, { type: "terminate_early"})

I almost always prefer explicit over implicit...

Sure, this could be done, but it would require much more overhead, as modes that are already stacked together would need to be handled. One would also need to test for correct application of the different kinds of chains (a terminate_early chain can be inserted into a keep_goind chain but not the other way around). Also, putting a chain into another chain really just means concatenating them, so the overall structure remains flat (or 2-dimensional, if one considers both kinds of chain). This nesting of calls suggests otherwise.

If you are looking to put this into core, this might be reasonable, but I'm not sure. I feel that the current structure is much easier to understand: Modes can be input in the form of 2-dimensional arrays, leading to the equivalent chain. Only modes that do not use starts are safe, unless one knows exactly what one is doing.

@schtandard
Copy link
Contributor Author

I also wonder if we're not using the existing grammar to it's full potential... if your rule is only matching a single regex then you don't need a "chain" to say "A, then spaces" (or "spaces then A"), you can do that all in a single rule:

{
  begin: /my expression/,  // the match
  end: /\s*/, // space eater expression,
  excludeEnd: true
}

That's true in the example I gave above. (Maybe I should have chosen a more complicated one.) It's only a special case, though. Really the example should have been

\section [short] {long}

where both short and long may contain other highlighted material as well as balanced brace groups and the spaces may be many spaces/tabs and single newlines. Sure, the spaces after \section could still be included in the same regex, but not the other ones. Even the closing bracket can't be found using purely regexes. (And I think the issue of balanced delimiters is a pretty common one in many languages.)

@schtandard
Copy link
Contributor Author

Are you sure the ends parents are even required in the second form? The starts chain should really unravel itself when done... the only reason endsParent is needed for contains is because you're using the child AS the parent in the way...

Not sure what you mean here. But yes, the endsParents are necessary in order to avoid repeating chain links.

@schtandard
Copy link
Contributor Author

If we're sticking with what the parser currently has implemented I'm far more partial to the second form - and having a single top level form that's used for chaining... as it's more explicit... every item leads into the next... and then if you want to deal with optional items or repeatable items you do that within the framework of a single form of chaining:

let CHAIN = [A, B, optional(C), repeatable(D)]

Where you'd embed this complexity in the mode themselves. So instead of:

// non repeatable (required, or sequence will short-circuit)
      contains: [{
        begin: /D/,
        starts: {

You'd end up generating something like:

// repeatable (0-x times)
      contains: [{
        begin: /\b|\B/,  // immediate match
        contains: [ { begin: /D/ }]
        starts: {

But this is all encapsulated in the repeatable helper, keeping the chain code itself very, very simple. And of course you could imagine an "at least once" variant as well, though that might be harder to do since you could have an end and you don't have use of starts... that one would require some thought.

That would be a nice syntax, yes. I see some difficulties with this, though (not sure how easy they are to overcome):

  • How would you avoid zero-width matches with optional chain links? In the second chaining mode, each mode has to be traversed, so making one of them optional would mean making it zero-width, right? (This also applies to your proposed pattern for repeatable modes.)
  • If we solve that problem, we could make modes repeat between m and n times pretty easily, by just chaining m mandatory and n optional ones in succession. Arbitrary repitition would not work that way, though.

@joshgoebel
Copy link
Member

All the information is there (the regex match groups), one just needs to add a way of assigning them classNames.

Except it's not because JS regex match doesn't give you positioning data for submatches... so if you have a string with groups separated by non-groups:

(\w)xxx(\w)yyy(\w)

It can be very hard to piece it back together later since all we get are the matches not the non-matches... so now you'd have to rewrite the regex to make everything a match group and then add a layer of abstraction... and now you're getting into performance concerns maybe.

No, that's what the endsParens are for. No link in the chain can occur more than once.

Ah, yes. Correct.

@joshgoebel
Copy link
Member

Sure, this could be done, but it would require much more overhead, as modes that are already stacked together would need to be handled. One would also need to test for correct application of the different kinds of chains (a terminate_early chain can be inserted into a keep_goind chain but not the other way around).

I'm not sure this is true... can you give me an example of what you do'nt think is possible with my suggestion?

Also, putting a chain into another chain really just means concatenating them

Pretty sure this is not true once you get into optionals/repeats because they need to be wrapped differently, they are truly "submodes" of the chain, not just items in it.

If you are looking to put this into core, this might be reasonable, but I'm not sure. I feel that the current structure is much easier to understand

Nested chains with one having entirely different semantics than the other doesn't seem simple to me. :-)

@joshgoebel
Copy link
Member

But yes, the endsParents are necessary in order to avoid repeating chain links.

It's working here for me (in the 2nd case) without any issue... what are we worried about repeated, the individual items? I don't see how that's possible in the 2nd form, but I'll take a second look.

@joshgoebel
Copy link
Member

How would you avoid zero-width matches with optional chain links? In the second chaining mode, each mode has to be traversed, so making one of them optional would mean making it zero-width, right? (This also applies to your proposed pattern for repeatable modes.)

This can be fixed in the parser... the real issue is with an infinite loop of 0-width, not just 0-width in general... so that could requirement could be relaxed.

If we solve that problem, we could make modes repeat between m and n times pretty easily, by just chaining m mandatory and n optional ones in succession. Arbitrary repitition would not work that way, though.

Wait, what? I just gave an example of 0-many... it's just a contains inside a mode... I think what we need is an abstract test suite with a bunch of examples... I think I might try and go tackle a few right this second then post back.

@joshgoebel
Copy link
Member

joshgoebel commented Oct 22, 2020

Regardless, I would say that this is something you should really think about adding to the parser. To me, it was really surprising that it was not already in there.

I'm not sure I disagree, but I'm also not 100% sure this isn't or wasn't intentional. Again we've always been a pattern matcher + a bit of context... starts is useful to build some very simple A + B chaining... but the kind of things you're trying to build here are really pushing the boundaries of what we've traditionally tried to do with most languages in the past. So if this is pattern we're going to start to encourage in the future I want to make sure that it's a pattern I like and makes sense in the grand scheme of where we want to go (not just "what works today").

Also worth nothing the 0-width is only a hard error in debug mode... in production mode it's a soft error and the parser just advances a token... which might not help this case (it can skip things), but it's just something worth knowing.

I just had a fresh thought... how much of this would a sequenced contains solve? This feels a lot more like the type of things I'm very often trying to describe in other grammars...

{ 
  // you'd probably want a look ahead for your primary match "\section"
  containsInOrder: [
    {...}, //  \section 
    {... optional: true }, //  [short]
    {...}, //  {long}
    {..., repeatable: true }
  ]

An "at least one" could be done with just a non-optional and then a repeatable (however it was abstracted)... The pattern would terminate whenever it could not find the next non-optional.

You could abstract it a bit nicer if you wanted but an embedded chain would be just another containsInOrder:

{ 
  containsInOrder: [
    { containsInOrder: [SECTION_MODE, GOBBLE_SPACE] }, //  \section 
    {... optional: true }, //  [short]
    {...}, //  {long}
    {..., repeatable: true }
  ]

I'm not sure we should support or encourage this (vs flat chaining which should work for MANY things)... but if we did I imagine you'd want the "can't move forward" to cascade up the chain... so if GOBBLE_SPACE was mandatory and then not found it would not just terminate the section mode, but the enter parent mode would end. And if containsInOrder was truly a parser level feature and didn't support nesting then of course someone could use starts to build some sort of simple manual nested if it truly proved to be necessary.

@schtandard
Copy link
Contributor Author

Except it's not because JS regex match doesn't give you positioning data for submatches... so if you have a string with groups separated by non-groups:

Damn, didn't know that. Now that feature's absence makes sense..

@schtandard
Copy link
Contributor Author

Sure, this could be done, but it would require much more overhead, as modes that are already stacked together would need to be handled. One would also need to test for correct application of the different kinds of chains (a terminate_early chain can be inserted into a keep_goind chain but not the other way around).

I'm not sure this is true... can you give me an example of what you do'nt think is possible with my suggestion?

To be clear: I was still referring to the two ways of chaining modes outlined in my original post. With those, there is just no space in the syntax for putting the first kind of chain into the second. If you try

starts: {
  endsParent: true,
  contains: [{...}],
  starts: {...}
}

that starts: {...} mode will never be entered, because the endParent kills the parent mode at the moment when it would be entered.

Leaving out the endParent is not an option, though, because that would allow going back in the chain (even skipping modes).

Also, putting a chain into another chain really just means concatenating them

Pretty sure this is not true once you get into optionals/repeats because they need to be wrapped differently, they are truly "submodes" of the chain, not just items in it.

Are you still talking about my second kind of chain here? In those, I don't see any way to "wrap" anything up into groups, since everything is just nesting. And putting something nested into another nesting (of the same kind) is indistinguishable from a single nesting.

If we are talking about your other suggestion on how this could maybe be realized, I'm not sure. It would have to become more specific, I guess. If new features are added to the parser, there may be a way, but right now I don't see any other possibility of combining starts and endsParent in such a way that one will not go backwards in the chain than the two kinds I described. If you know better, a concrete example would certainly help me.


I added a language with some (working and non-working) sample chains here, in case you want to play around with it a bit.

@schtandard
Copy link
Contributor Author

Regardless, I would say that this is something you should really think about adding to the parser. To me, it was really surprising that it was not already in there.

I'm not sure I disagree, [...] but the kind of things you're trying to build here are really pushing the boundaries of what we've traditionally tried to do with most languages in the past.

I think you misunderstood. I was referring to being able to assign classNames to regex match groups. But if JavaScript doesn't really offer that possibility, then it's a moot point.

I just had a fresh thought... how much of this would a sequenced contains solve? This feels a lot more like the type of things I'm very often trying to describe in other grammars...

{ 
  // you'd probably want a look ahead for your primary match "\section"
  containsInOrder: [
    {...}, //  \section 
    {... optional: true }, //  [short]
    {...}, //  {long}
    {..., repeatable: true }
  ]

An "at least one" could be done with just a non-optional and then a repeatable (however it was abstracted)... The pattern would terminate whenever it could not find the next non-optional.

You could abstract it a bit nicer if you wanted but an embedded chain would be just another containsInOrder:

{ 
  containsInOrder: [
    { containsInOrder: [SECTION_MODE, GOBBLE_SPACE] }, //  \section 
    {... optional: true }, //  [short]
    {...}, //  {long}
    {..., repeatable: true }
  ]

I'm not sure we should support or encourage this (vs flat chaining which should work for MANY things)... but if we did I imagine you'd want the "can't move forward" to cascade up the chain... so if GOBBLE_SPACE was mandatory and then not found it would not just terminate the section mode, but the enter parent mode would end. And if containsInOrder was truly a parser level feature and didn't support nesting then of course someone could use starts to build some sort of simple manual nested if it truly proved to be necessary.

This would be very nice (and obviously much better than my hack here). I'm not sure I understand how this could interact with competing fields like starts and contains, but that could probably be figured out.

@schtandard
Copy link
Contributor Author

I wonder: If this containsInOrder idea is something you want to pursue (which would presumably be a longer term project), would it still be possible to include this workaround into this grammar? Right now, it just creates the same starts chains that are already in the language in a different way (and to me at least, this chaining is easier to overlook than the ARGUMENTS_AND_THEN stuff from before). If we want to highlight things like sectioning commands, environments, etc., similar chains are necessary (until a better alternative is ready).

Or would you prefer halting the work on this grammar until the chaining question has been resolved?

@joshgoebel
Copy link
Member

joshgoebel commented Oct 25, 2020

that starts: {...} mode will never be entered, because the endParent kills the parent mode at the moment when it would be entered.

Yeah, this stuff is super confusing and annoying. :)

If we are talking about your other suggestion on how this could maybe be realized, I'm not sure. It would have to become more specific, I guess.

I went and tried to implement it (with optional trees), it was gnarly, and didn't work because eventually you'd get stuck in an infinite loop bouncing from subtree back to the parent then matching the optional missing again and then starts tosses you down a layer again... there no way to "escape". Not super exciting about building any type of complex chaining with these building blocks. Maybe there is still a way, but it was giving me a headache.

...would it still be possible to include this workaround into this grammar? ... this chaining is easier to overlook than the ARGUMENTS_AND_THEN stuff from before). If we want to highlight things like sectioning commands, environments, etc., similar chains are necessary (until a better alternative is ready).

I've given this thought and as-is, no - but I think we can likely find a reasonable compromise - let me show you what I propose. I think what you have violates some philosophies of grammar complexity. It's too complex and introduces new concepts that I don't think are a good general abstraction. I don't think mixing the concepts of the two different types of chaining in one nested structure is good. Implicit bad, explicit good. I'm not sure we even need two different types of chaining at all to solve this in a reasonable fashion. (or maybe you do but ARGUMENT_AND_THEN is hiding it, if so that's ok because ARGUMENT_AND_THEN can be understood at a quick glance.)

Grammars should strive to be simple, and modes (for the most part) should be embeddable, interchangeable, etc... if I have an array of modes it's semantics shouldn't suddenly be different because suddenly it's nested vs top-level. This makes it impossible to know what any individual array does at the moment I'm reading it in source without knowing the FULL context - which may or may not be available at any given point... requiring one to read (and understanding) the whole grammar to make sure they understand all the context... preferably the reader should need to read as little as possible.

So your helpers should strive to be useful, but not overly clever. So let's go back to what you had before and try to instead expand on that slightly. It seems the useful thing here is avoiding writing silly things like deeply nested calls to ARGUMENT_AND_THEN. ARGUMENT_AND_THEN seems a fine helper in its own right. It's simple, easy to follow, and easily composeable. (a great helper)

So the call to minted:

    BEGIN_ENV('minted', ARGS(ARGUMENT_O, ARGUMENT_M, VERBATIM_DELIMITED_ENV('minted')))

ARGS taking a list of arguments and simply transforming it into a ARGUMENT_AND_THEN chain. Another tiny, modular helper.

  const ARGS = (...modes) => {
    const core = modes.pop()
    return modes.reverse().reduce((acc, n) => {
      return ARGUMENT_AND_THEN(n, acc);
    }, core);
  };

And for mint... a simple flat chain:

// I was just hacking and ARGUMENT_M was wrapped in an array but I don't need the array, hence the [0]
// I assume this would change for the final PR
CSNAME('mint', FLAT_CHAIN(GOBBLE_SPACES, ARGUMENT_M[0], GOBBLE_NEWLINE, VERBATIM_DELIMITED_EQUAL())),

This is the "optional" variety of chaining since that's more in line with how the parser typically works. We should trust that the latex file makes some sort of sense, so the optionals should fire or not fire in reasonable fashion. Again, we shouldn't be trying hard to parse the latex. If some pre-fire sanity check needs to be done it should probably be done in the parent rule via a rough approximation look-ahead.

IE:

  const FLAT_CHAIN = (...modes) => {
    const core = { contains: [modes.pop()] };
    return modes.reverse().reduce((acc, mode) => {
      return {
        relevance: 0,
        contains: [mode],
        starts : acc
      };
    }, core);
  };

Of course you can merge an endsParent into mode there if necessary.


If that works well I might take FLAT_CHAIN and clean it up a bit (deal with the edge cases, empty array, single item, etc) and that seems like something that would be useful for many grammars... maybe even abstracting the API a bit:

// returns a singular mode representing the chain
// modes of course not being allowed to use `endsParent` or `starts`
MODES.chain(
  { sequence: [mode1, mode2, mode3], // will match one after the other (skipping any missing)
     lookahead: /.../ }  // optional, must match for the chain to even begin
)

If the modes were more of the "simple" variety (regex, no submodes, etc.) then chain could even construct the look-ahead itself... though I'm not sure if the functionality should suddenly change (required vs optional) just based on the types of modes, that would seem to violate principle of least surprise. One could imagine another optional flag though to make it explicit (but only supported with simple modes):

// required only supported for simple modes
MODES.chain({ sequence: [mode1, mode2, mode3]}, requireAll: true } 
)

@joshgoebel
Copy link
Member

joshgoebel commented Oct 25, 2020

And personally I think the added context at the top layer helps, ie:

// explicit better
CSNAME('mint', FLAT_CHAIN(GOBBLE_SPACES, ARGUMENT_M[0], GOBBLE_NEWLINE, VERBATIM_DELIMITED_EQUAL())),
// or
CSNAME('mint', { chain: [GOBBLE_SPACES, ARGUMENT_M[0], GOBBLE_NEWLINE, VERBATIM_DELIMITED_EQUAL()] }),

// vs hiding the context (implicit)
CSNAME('mint', [GOBBLE_SPACES, ARGUMENT_M[0], GOBBLE_NEWLINE, VERBATIM_DELIMITED_EQUAL()]),

@joshgoebel joshgoebel added the WIP label Oct 31, 2020
@schtandard
Copy link
Contributor Author

Thanks for the thorough notes and sorry for the long silence; busy times. I'll try and do this the way you outlined this weekend.

Do I understand you correctly that, if and when I need the non-optional kind of chain, I should just make them manually (like in the current version of the grammar) instead of writing a helper function (even if it's separate from the optional-chain helper)?

@joshgoebel
Copy link
Member

joshgoebel commented Nov 8, 2020

Do I understand you correctly that, if and when I need the non-optional kind of chain, I should just make them manually (like in the current version of the grammar) instead of writing a helper function (even if it's separate from the optional-chain helper)?

I'm pretty sure I wasn't saying no helpers, but only simpler more modular helpers vs a single large chain helper with different semantics for nest vs non-nested arrays, etc... as my example for arguments. You already have this "argument and then" concept, which works well... so chain it by wrapping the concept of an argument list:

  const ARGS = (...modes) => {
    const core = modes.pop()
    return modes.reverse().reduce((acc, n) => {
      return ARGUMENT_AND_THEN(n, acc);
    }, core);
  };

I'm not sure if you're calling this "manual"... I'd call it helpers all the way down, just using two smaller helper vs one super "smart" helper. And please use array helpers (reduce, etc) whenever possible rather than recursion. Recursion is much harder for many to reason about than the more familiar built-in array helpers.

So both my args and flat_chain were optional chains I believe. I'm still questioning the need for these complex chains. I feel like your first urge is to try to parse the language rather than doing a simpler thing. For example, in a perfect world we'd ignore the full sequences completely (not trying to understand them at all) and just highlight \command, {.*}, [.*], ie simple match expressions. I think I understand why that's problematic for LaTex, but even if so we still want to try to stick close as close to the "intention" of that ideal as possible.

In this case (mandatory chain) I feel like maybe you're trying to figure out what is valid or invalid LaTex, which is not our job. If we see a \widget"command" and that should mean \widget[optional]{required}{required} Then a simple rule that does:

  • match \widget (to start the chain)
  • optional match [optional]
  • optional match {required}
  • optional match {required}

Should get the job done for valid LaTex (correct me if I'm wrong), no? And remember that this already is much much more complex than what we'd typically prefer to see...

Anytime I see anything like "a MUST follow B which MUST follow C which MUST follow D" [that can't be written in it's entirety as a simple regex] I get very nervous. This it not the type of complexity we generally have or desire in grammars. Honestly what you'd probably at that point is a back-tracking parser... ie, if C and D are missing then really the entire ABCD chain becomes invalid... but that's not possible (if you can't cover the whole expression with a regex expression - ie making sure it's all present before you even being the chain).

We'd much prefer to use simple heuristics (whenever possible) rather than super-complex nested rule trees. Is there a common real life scenario where an optional chain wouldn't get the job done adequately? At a first glance I thought what I was proposing would solve the existing problems we have in the grammar.

IE, if you took my ARGS and FLAT_CHAIN functions as-is and used them (I already did this, but only for 2-3 rules)... what problems are remaining that are unsolved? I'll find my branch and push it.

https://github.com/joshgoebel/highlight.js/tree/latex_suggestion

@joshgoebel
Copy link
Member

Related:

The work over in #2834 may be of interest also. Pretty much syntactic sugar around helpers, but useful helpers would find their way into the core parser... so we'd have an easy path for "omg, this grammar totally nailed this concept => add to parser" (no such path exists for helpers now). Not everything could be accomplished this way, but many things we already model (just with difficulty) can be.

I wonder if this would be much simplified with something like beforeMatch [as seen in #2824] (and loosening of the 0-width rule, which is also in the works). IE, I think your SPACE concerns could probably then be handled with beforeMatch, yes? That would simplify ARGUMENT_AND_THEN. I also wonder if loosening the 0-width match rule wouldn't simplify all this... i.e. entire chains could share the same structure rather than necessitating mixed chaining types.

So lets imaging we live in a world where 0-width rules are permitted (so long as they don't result in infinite loops)... does that make everything here easier to accomplish? And perhaps right there is actually an argument for mandatory chains with optional (as in 0-width) components...

@joshgoebel
Copy link
Member

joshgoebel commented Nov 8, 2020

This it not the type of complexity we generally have or desire in grammars.

It's also certain that this is a moving target, ie as we find the right abstractions and get them in the core parser (and they are will understood) that we become more comfortable with grammars doing crazier things (like we're already more comfortable with certain ugly patterns, because they are common patterns). Definitely likely. And it may even be possible that whats lands in the core parser isn't that far from your original idea... though I think you'll find key differences/simplifications.

Like conceptually beforeMatch presents as a single rule... and for all intents and purposes would behave as such. This is very different conceptually than a chain: [SPACE, rule]... So something you'd represent as:

MODE_CHAIN([A, [SPC, C], [SPC, E]]) // nested chains

Becomes merely:

MODE_CHAIN(A, B, C) // just a simple chain (with space concerns appearing just to be part of B & C's rules)

And if we fix 0 width rules then something you'd have writen as: (all required)

// [['\section', spaces], ['[', shortform, ']', spaces], ['{', longform, '}']]
MODE_CHAIN([[A, A2], [B, C, D], [E, F, G]])

Would simplify to:

// either (with A expressed with sugar)
FLAT_CHAIN(A,FLAT_CHAIN(B,C,D),FLAT_CHAIN(E,F,G))
// nested chains
FLAT_CHAIN(FLAT_CHAIN(A,A2),FLAT_CHAIN(B,C,D),FLAT_CHAIN(E,F,G))

The key difference being there is nothing magical about nested chains so that could be just be flattened easily (because nested arrays have no different semantics than un-nested:

FLAT_CHAIN(A, A2, B, C, D, E, F, G)

Though if possible you'd probably want to consider DEF begin covered by an over-arching regex/rule (if possible), so then we'd be right back to a simple chain:

FLAT_CHAIN(A, B, DEF)

Just more of a high level overview of how I think I see the optimal abstraction here off the top of my head.

@joshgoebel
Copy link
Member

joshgoebel commented Nov 8, 2020

IE something like this:

let rule = {
  chain: [
    {
      className: "highlight",
      match: /\\section/,
      followedBy: optional(WHITESPACE)
    },
    {
      chain: [ /\[/, SHORT_FORM, /\]/ ],
      followedBy: optional(WHITESPACE)
    },
    {
      chain: [ /\{/, LONG_FORM, /\}/ ]
    },
  ]
}

Would just collapse to a flat chain with some potential 0 width matchers (whitespace)... and if a single complex element become optional then the right way to represent might just be multiple trees, not a single crazy tree trying to do it all.

IE:

let rule = {
  chain: [
    {
      className: "highlight",
      match: /\\section/,
      followedBy: optional(WHITESPACE)
    },
    {
      optional: true,
      chain: [ /\[/, SHORT_FORM, /\]/ ],
      followedBy: optional(WHITESPACE)
    },
    {
      chain: [ /\{/, LONG_FORM, /\}/ ]
    },
  ]
}

The proper way to compile this may be [very abstractly]:

{
  variants: [
    { chain: [ /\\section/, [ /\[/, SHORT_FORM, /\]/ ], [ /\{/, LONG_FORM, /\}/ ] },
    { chain: [ /\\section/, [ /\{/, LONG_FORM, /\}/ ] },
  ]
}

Internally you'd end up being a tree... That would be easier to debug since at any given branch you could figure out WHICH tree you were inside of rather than staring at a SINGLE massive mode trying to cover all possibilities. And within each chain you're still dealign with just a singular sequence of the same type of chaining. Maybe it's a crazy idea, I dunno. :) It also feels a lot more like real parsing a well. :-/

Yes, that would get gnarly for a long list of optionals, but again to that I'd say: we're not a parser... and at the point we're jumping thru too many hoops we may need to step back and simplify things or move the goal posts.

@joshgoebel
Copy link
Member

joshgoebel commented Nov 8, 2020

Just because that was quite a lot to read... if you want to move forward with utmost speed (without waiting or exploring the use of any of the new things in the works) then I'd suggest:

  • IE, if you took my ARGS and FLAT_CHAIN functions as-is and used them (I already did this, but only for 2-3 rules)... what problems are remaining that are unsolved? Specific examples would help.

Also, perhaps reconsider is ALL this complexity really needed for MOST common cases? If there are 2-3 weird contextual cases could those be matched with an EXPLICIT rule and then just have a list of generic rules to handle all the remaining more boring cases?

For example, it seems the real "odd duck" here is VERBATIM_DELIMITED_EQUAL... (which could even be { I'm presuming)... but once you handled those 4 cases verb, lstinline, mint, mintinline... could a super simple ruleset highlight other arguments individually without any context at all? Or perhaps only within the context of \something[arguments here][delimiter][content here]?


On an entirely related/unrelated note... if you had free rein to just write a [small] JS parser to just parse LaTex however you pleased and then split out a token list/tree at the end... would that make this problem easier? Harder?

@joshgoebel
Copy link
Member

joshgoebel commented Nov 8, 2020

Would there be a single real mid-size document/manuscript somewhere that used 95% of common LaTex that I could use to play around? It would really help to have something large to go back to to perhaps answer some of my own questions. The tests aren't very good at this because I feel they have a lot of hidden assumptions baked in. (and often don't represent real content - ie, latex PLUS text content, etc)

@joshgoebel
Copy link
Member

joshgoebel commented Nov 8, 2020

Lets take one simple example of simplifying:

\href[options]{url}{text}

If we don't care about options (we don't color them or give them a class) why do we need to match it at all? Why not just match:

\href ... {url} 

But I assume it's about their submodes (the could contain a control seq, etc) so then why not:

\href [optional]* {url} 

Why MUST the highlighter KNOW that href has a single optional arg? Why can't optional arguments be handled the way you handle whitespace in many places? Consume it if it's there just to get onto the next thing we truly care about...

@joshgoebel
Copy link
Member

Also relevant: #2838

@joshgoebel joshgoebel added the autoclose Flag things to future autoclose. label Jan 29, 2021
@joshgoebel
Copy link
Member

Are there any good small, light-weight Latex parser/lexers already written in JS that could do all the tokenization heavy lifting for us, etc?

Context: my last comment here highlightjs/highlightjs-rdflang#2

@joshgoebel
Copy link
Member

joshgoebel commented Mar 17, 2021

Adding for reference since the linked branch above many not live forever. Auto-closing.

commit fe01f54b13d54d67bdb85d73376d3ea7d54391df
Author: Josh Goebel <me@joshgoebel.com>
Date:   Sun Oct 25 13:39:39 2020 -0400

    revert/wip

diff --git a/src/languages/latex.js b/src/languages/latex.js
index 0cadd1a0..9eea8093 100644
--- a/src/languages/latex.js
+++ b/src/languages/latex.js
@@ -124,6 +124,18 @@ export default function(hljs) {
     MAGIC_COMMENT,
     COMMENT
   ];
+  const GOBBLE_SPACES = {
+    begin: /[ \t]+(?:\r?\n[ \t]*)?|\r?\n[ \t]*/,
+    relevance: 0
+  };
+  const GOBBLE_SPACES_NO_NEWLINE = {
+    begin: /[ \t]+/,
+    relevance: 0
+  };
+  const GOBBLE_NEWLINE = {
+    begin: /\r?\n/,
+    relevance: 0
+  };
   const BRACE_GROUP_NO_VERBATIM = {
     begin: /\{/, end: /\}/,
     relevance: 0,
@@ -220,9 +232,26 @@ export default function(hljs) {
       }
     };
   };
+  const ARGS = (...modes) => {
+    const core = modes.pop()
+    return modes.reverse().reduce((acc, n) => {
+      return ARGUMENT_AND_THEN(n, acc);
+    }, core);
+  };
+  const FLAT_CHAIN = (...modes) => {
+    const core = { contains: [modes.pop()] };
+    return modes.reverse().reduce((acc, n) => {
+      return {
+        relevance: 0,
+        contains: [n],
+        starts : acc
+      };
+    }, core);
+  };
   const VERBATIM = [
     ...['verb', 'lstinline'].map(csname => CSNAME(csname, {contains: [VERBATIM_DELIMITED_EQUAL()]})),
-    CSNAME('mint', ARGUMENT_AND_THEN(ARGUMENT_M, {contains: [VERBATIM_DELIMITED_EQUAL()]})),
+    // CSNAME('mint', ARGUMENT_AND_THEN(ARGUMENT_M, {contains: [VERBATIM_DELIMITED_EQUAL()]})),
+    CSNAME('mint', FLAT_CHAIN(GOBBLE_SPACES, ARGUMENT_M[0], GOBBLE_NEWLINE, VERBATIM_DELIMITED_EQUAL())),
     CSNAME('mintinline', ARGUMENT_AND_THEN(ARGUMENT_M, {contains: [VERBATIM_DELIMITED_BRACES(), VERBATIM_DELIMITED_EQUAL()]})),
     CSNAME('url', {contains: [VERBATIM_DELIMITED_BRACES("link"), VERBATIM_DELIMITED_BRACES("link")]}),
     CSNAME('hyperref', {contains: [VERBATIM_DELIMITED_BRACES("link")]}),
@@ -234,7 +263,8 @@ export default function(hljs) {
         BEGIN_ENV(prefix + 'Verbatim' + suffix, ARGUMENT_AND_THEN(ARGUMENT_O, VERBATIM_DELIMITED_ENV(prefix + 'Verbatim' + suffix)))
       )
     ])),
-    BEGIN_ENV('minted', ARGUMENT_AND_THEN(ARGUMENT_O, ARGUMENT_AND_THEN(ARGUMENT_M, VERBATIM_DELIMITED_ENV('minted')))),
+    BEGIN_ENV('minted', ARGS(ARGUMENT_O, ARGUMENT_M, VERBATIM_DELIMITED_ENV('minted')))
+    // BEGIN_ENV('minted', ARGUMENT_AND_THEN(ARGUMENT_O, ARGUMENT_AND_THEN(ARGUMENT_M, VERBATIM_DELIMITED_ENV('minted')))),
   ];
 
   return {

@joshgoebel joshgoebel closed this Mar 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autoclose Flag things to future autoclose.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants