-
Notifications
You must be signed in to change notification settings - Fork 3.7k
enh(latex): Implement an easy to use chaining mechanism #2776
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is a lot to digest. There are a few high level issues here that I think about from the perspective of the core library. First, easy chaining is definitely something we're lacking. Yet I'm also hesitant to pursue a solution that isn't something that I could imagine potentially being rolled forward into a more generic solution that the entire library could use. IE, each grammar (or even a few) building their own quite complex meta-constructs isn't something I want to encourage. And complex chaining definitely crosses over into that territory. It's also quite likely that something of this complexity needs its own specific tests apart from grammar tests. I'd rather we perhaps try and first imagine the good abstraction (ie, how to describe chains in the grammar) [looking across multiple languages]... did you read the other issue thread on this topic? And then decide if that's best done with meta-modes or if we need to build some new functionality into the parser itself. Now I'm not in a hurry to add more and more features to the core parser, but this may be an area where it's really worth looking at doing so - or if there are better building blocks that should exist to support this than I wonder how much of this could be solved if we simply had a way to color different regex groups differently? Would you need mode chaining at all then?
You completely lost me here. Are we saying that "inner" chains must always require each element (or terminate) but that in "outer" chains any element is optional? Or was it the reverse? The inner/outer naming doesn't help in understanding I don't think.
You forgot to mention that in the first form matches can easily repeat also:
Since contains will just keep looking for the next match until no more can be found. |
I also feel at a gut level that perhaps the chaining implementation should not deal with recursive chaining on it's own... if one wants a chain inside a chain, then write it as such: var MODE_C = CHAIN(X,Y,Z)
var MATCH = CHAIN(MODE_A, MODE_B, MODE_C) If there can be different types of chains then each chain should probably express its own type. Just spitballing: var MODE_C = CHAIN(X,Y,Z, { type: "keep_going"})
var MATCH = CHAIN(MODE_A, MODE_B, MODE_C, { type: "terminate_early"}) I almost always prefer explicit over implicit... |
I also wonder if we're not using the existing grammar to it's full potential... if your rule is only matching a single regex then you don't need a "chain" to say "A, then spaces" (or "spaces then A"), you can do that all in a single rule: {
begin: /my expression/, // the match
end: /\s*/, // space eater expression,
excludeEnd: true
} |
Are you sure the ends parents are even required in the second form? The starts chain should really unravel itself when done... the only reason endsParent is needed for contains is because you're using the child AS the parent in the way... |
If we're sticking with what the parser currently has implemented I'm far more partial to the second form - and having a single top level form that's used for chaining... as it's more explicit... every item leads into the next... and then if you want to deal with optional items or repeatable items you do that within the framework of a single form of chaining:
Where you'd embed this complexity in the mode themselves. So instead of: // non repeatable (required, or sequence will short-circuit)
contains: [{
begin: /D/,
starts: { You'd end up generating something like: // repeatable (0-x times)
contains: [{
begin: /\b|\B/, // immediate match
contains: [ { begin: /D/ }]
starts: { But this is all encapsulated in the |
True.
You mean #1140? I had not, but I did now. Haven't quite wrapped my head around the implications of that proposal, though. I don't think I'm familiar enough with the internals of the parser to understand you suggestion of hacking it in by making What I did here was just make the best of what's already in the parser (using
That's true. I can't imagine anything now from the top of my head, maybe I'll think of something in time..
This would indeed greatly reduce the need for chaining, but not eliminate it. Regardless, I would say that this is something you should really think about adding to the parser. To me, it was really surprising that it was not already in there. All the information is there (the regex match groups), one just needs to add a way of assigning them It would not totally eliminate the need for chaining, though, as some patterns cannot (easily) be matched using regexes. One example are groups of matched braces. Another example is the need for further highlighting inside one of the chain links. For example, in
Yes, in outer chains each link is optional and may be skipped, while inner chains can only be terminated early, but require each link apart from that. It's true that the inner/outer naming doesn't help here. I chose it because inner chains can be inside outer chains but not the other way around.
No, that's what the To make it explicit, the chain
|
Sure, this could be done, but it would require much more overhead, as modes that are already stacked together would need to be handled. One would also need to test for correct application of the different kinds of chains (a If you are looking to put this into core, this might be reasonable, but I'm not sure. I feel that the current structure is much easier to understand: Modes can be input in the form of 2-dimensional arrays, leading to the equivalent chain. Only modes that do not use |
That's true in the example I gave above. (Maybe I should have chosen a more complicated one.) It's only a special case, though. Really the example should have been \section [short] {long} where both short and long may contain other highlighted material as well as balanced brace groups and the spaces may be many spaces/tabs and single newlines. Sure, the spaces after |
Not sure what you mean here. But yes, the |
That would be a nice syntax, yes. I see some difficulties with this, though (not sure how easy they are to overcome):
|
Except it's not because JS regex match doesn't give you positioning data for submatches... so if you have a string with groups separated by non-groups:
It can be very hard to piece it back together later since all we get are the matches not the non-matches... so now you'd have to rewrite the regex to make everything a match group and then add a layer of abstraction... and now you're getting into performance concerns maybe.
Ah, yes. Correct. |
I'm not sure this is true... can you give me an example of what you do'nt think is possible with my suggestion?
Pretty sure this is not true once you get into optionals/repeats because they need to be wrapped differently, they are truly "submodes" of the chain, not just items in it.
Nested chains with one having entirely different semantics than the other doesn't seem simple to me. :-) |
It's working here for me (in the 2nd case) without any issue... what are we worried about repeated, the individual items? I don't see how that's possible in the 2nd form, but I'll take a second look. |
This can be fixed in the parser... the real issue is with an infinite loop of 0-width, not just 0-width in general... so that could requirement could be relaxed.
Wait, what? I just gave an example of 0-many... it's just a |
I'm not sure I disagree, but I'm also not 100% sure this isn't or wasn't intentional. Again we've always been a pattern matcher + a bit of context... Also worth nothing the 0-width is only a hard error in debug mode... in production mode it's a soft error and the parser just advances a token... which might not help this case (it can skip things), but it's just something worth knowing. I just had a fresh thought... how much of this would a sequenced {
// you'd probably want a look ahead for your primary match "\section"
containsInOrder: [
{...}, // \section
{... optional: true }, // [short]
{...}, // {long}
{..., repeatable: true }
] An "at least one" could be done with just a non-optional and then a repeatable (however it was abstracted)... The pattern would terminate whenever it could not find the next non-optional. You could abstract it a bit nicer if you wanted but an embedded chain would be just another containsInOrder: {
containsInOrder: [
{ containsInOrder: [SECTION_MODE, GOBBLE_SPACE] }, // \section
{... optional: true }, // [short]
{...}, // {long}
{..., repeatable: true }
] I'm not sure we should support or encourage this (vs flat chaining which should work for MANY things)... but if we did I imagine you'd want the "can't move forward" to cascade up the chain... so if GOBBLE_SPACE was mandatory and then not found it would not just terminate the section mode, but the enter parent mode would end. And if |
Damn, didn't know that. Now that feature's absence makes sense.. |
To be clear: I was still referring to the two ways of chaining modes outlined in my original post. With those, there is just no space in the syntax for putting the first kind of chain into the second. If you try
that Leaving out the
Are you still talking about my second kind of chain here? In those, I don't see any way to "wrap" anything up into groups, since everything is just nesting. And putting something nested into another nesting (of the same kind) is indistinguishable from a single nesting. If we are talking about your other suggestion on how this could maybe be realized, I'm not sure. It would have to become more specific, I guess. If new features are added to the parser, there may be a way, but right now I don't see any other possibility of combining I added a language with some (working and non-working) sample chains here, in case you want to play around with it a bit. |
I think you misunderstood. I was referring to being able to assign
This would be very nice (and obviously much better than my hack here). I'm not sure I understand how this could interact with competing fields like |
I wonder: If this Or would you prefer halting the work on this grammar until the chaining question has been resolved? |
Yeah, this stuff is super confusing and annoying. :)
I went and tried to implement it (with optional trees), it was gnarly, and didn't work because eventually you'd get stuck in an infinite loop bouncing from subtree back to the parent then matching the optional missing again and then starts tosses you down a layer again... there no way to "escape". Not super exciting about building any type of complex chaining with these building blocks. Maybe there is still a way, but it was giving me a headache.
I've given this thought and as-is, no - but I think we can likely find a reasonable compromise - let me show you what I propose. I think what you have violates some philosophies of grammar complexity. It's too complex and introduces new concepts that I don't think are a good general abstraction. I don't think mixing the concepts of the two different types of chaining in one nested structure is good. Implicit bad, explicit good. I'm not sure we even need two different types of chaining at all to solve this in a reasonable fashion. (or maybe you do but ARGUMENT_AND_THEN is hiding it, if so that's ok because ARGUMENT_AND_THEN can be understood at a quick glance.) Grammars should strive to be simple, and modes (for the most part) should be embeddable, interchangeable, etc... if I have an array of modes it's semantics shouldn't suddenly be different because suddenly it's nested vs top-level. This makes it impossible to know what any individual array does at the moment I'm reading it in source without knowing the FULL context - which may or may not be available at any given point... requiring one to read (and understanding) the whole grammar to make sure they understand all the context... preferably the reader should need to read as little as possible. So your helpers should strive to be useful, but not overly clever. So let's go back to what you had before and try to instead expand on that slightly. It seems the useful thing here is avoiding writing silly things like deeply nested calls to ARGUMENT_AND_THEN. ARGUMENT_AND_THEN seems a fine helper in its own right. It's simple, easy to follow, and easily composeable. (a great helper) So the call to BEGIN_ENV('minted', ARGS(ARGUMENT_O, ARGUMENT_M, VERBATIM_DELIMITED_ENV('minted')))
const ARGS = (...modes) => {
const core = modes.pop()
return modes.reverse().reduce((acc, n) => {
return ARGUMENT_AND_THEN(n, acc);
}, core);
}; And for // I was just hacking and ARGUMENT_M was wrapped in an array but I don't need the array, hence the [0]
// I assume this would change for the final PR
CSNAME('mint', FLAT_CHAIN(GOBBLE_SPACES, ARGUMENT_M[0], GOBBLE_NEWLINE, VERBATIM_DELIMITED_EQUAL())), This is the "optional" variety of chaining since that's more in line with how the parser typically works. We should trust that the latex file makes some sort of sense, so the optionals should fire or not fire in reasonable fashion. Again, we shouldn't be trying hard to parse the latex. If some pre-fire sanity check needs to be done it should probably be done in the parent rule via a rough approximation look-ahead. IE: const FLAT_CHAIN = (...modes) => {
const core = { contains: [modes.pop()] };
return modes.reverse().reduce((acc, mode) => {
return {
relevance: 0,
contains: [mode],
starts : acc
};
}, core);
}; Of course you can merge an If that works well I might take // returns a singular mode representing the chain
// modes of course not being allowed to use `endsParent` or `starts`
MODES.chain(
{ sequence: [mode1, mode2, mode3], // will match one after the other (skipping any missing)
lookahead: /.../ } // optional, must match for the chain to even begin
) If the modes were more of the "simple" variety (regex, no submodes, etc.) then // required only supported for simple modes
MODES.chain({ sequence: [mode1, mode2, mode3]}, requireAll: true }
) |
And personally I think the added context at the top layer helps, ie: // explicit better
CSNAME('mint', FLAT_CHAIN(GOBBLE_SPACES, ARGUMENT_M[0], GOBBLE_NEWLINE, VERBATIM_DELIMITED_EQUAL())),
// or
CSNAME('mint', { chain: [GOBBLE_SPACES, ARGUMENT_M[0], GOBBLE_NEWLINE, VERBATIM_DELIMITED_EQUAL()] }),
// vs hiding the context (implicit)
CSNAME('mint', [GOBBLE_SPACES, ARGUMENT_M[0], GOBBLE_NEWLINE, VERBATIM_DELIMITED_EQUAL()]), |
Thanks for the thorough notes and sorry for the long silence; busy times. I'll try and do this the way you outlined this weekend. Do I understand you correctly that, if and when I need the non-optional kind of chain, I should just make them manually (like in the current version of the grammar) instead of writing a helper function (even if it's separate from the optional-chain helper)? |
I'm pretty sure I wasn't saying no helpers, but only simpler more modular helpers vs a single large chain helper with different semantics for nest vs non-nested arrays, etc... as my example for arguments. You already have this "argument and then" concept, which works well... so chain it by wrapping the concept of an argument list: const ARGS = (...modes) => {
const core = modes.pop()
return modes.reverse().reduce((acc, n) => {
return ARGUMENT_AND_THEN(n, acc);
}, core);
}; I'm not sure if you're calling this "manual"... I'd call it helpers all the way down, just using two smaller helper vs one super "smart" helper. And please use array helpers ( So both my In this case (mandatory chain) I feel like maybe you're trying to figure out what is valid or invalid LaTex, which is not our job. If we see a
Should get the job done for valid LaTex (correct me if I'm wrong), no? And remember that this already is much much more complex than what we'd typically prefer to see... Anytime I see anything like "a MUST follow B which MUST follow C which MUST follow D" [that can't be written in it's entirety as a simple regex] I get very nervous. This it not the type of complexity we generally have or desire in grammars. Honestly what you'd probably at that point is a back-tracking parser... ie, if C and D are missing then really the entire ABCD chain becomes invalid... but that's not possible (if you can't cover the whole expression with a regex expression - ie making sure it's all present before you even being the chain). We'd much prefer to use simple heuristics (whenever possible) rather than super-complex nested rule trees. Is there a common real life scenario where an optional chain wouldn't get the job done adequately? At a first glance I thought what I was proposing would solve the existing problems we have in the grammar. IE, if you took my https://github.com/joshgoebel/highlight.js/tree/latex_suggestion |
Related: The work over in #2834 may be of interest also. Pretty much syntactic sugar around helpers, but useful helpers would find their way into the core parser... so we'd have an easy path for "omg, this grammar totally nailed this concept => add to parser" (no such path exists for helpers now). Not everything could be accomplished this way, but many things we already model (just with difficulty) can be. I wonder if this would be much simplified with something like So lets imaging we live in a world where 0-width rules are permitted (so long as they don't result in infinite loops)... does that make everything here easier to accomplish? And perhaps right there is actually an argument for mandatory chains with optional (as in 0-width) components... |
It's also certain that this is a moving target, ie as we find the right abstractions and get them in the core parser (and they are will understood) that we become more comfortable with grammars doing crazier things (like we're already more comfortable with certain ugly patterns, because they are common patterns). Definitely likely. And it may even be possible that whats lands in the core parser isn't that far from your original idea... though I think you'll find key differences/simplifications. Like conceptually MODE_CHAIN([A, [SPC, C], [SPC, E]]) // nested chains Becomes merely: MODE_CHAIN(A, B, C) // just a simple chain (with space concerns appearing just to be part of B & C's rules) And if we fix 0 width rules then something you'd have writen as: (all required) // [['\section', spaces], ['[', shortform, ']', spaces], ['{', longform, '}']]
MODE_CHAIN([[A, A2], [B, C, D], [E, F, G]]) Would simplify to: // either (with A expressed with sugar)
FLAT_CHAIN(A,FLAT_CHAIN(B,C,D),FLAT_CHAIN(E,F,G))
// nested chains
FLAT_CHAIN(FLAT_CHAIN(A,A2),FLAT_CHAIN(B,C,D),FLAT_CHAIN(E,F,G)) The key difference being there is nothing magical about nested chains so that could be just be flattened easily (because nested arrays have no different semantics than un-nested: FLAT_CHAIN(A, A2, B, C, D, E, F, G) Though if possible you'd probably want to consider DEF begin covered by an over-arching regex/rule (if possible), so then we'd be right back to a simple chain: FLAT_CHAIN(A, B, DEF) Just more of a high level overview of how I think I see the optimal abstraction here off the top of my head. |
IE something like this:
Would just collapse to a flat chain with some potential 0 width matchers (whitespace)... and if a single complex element become optional then the right way to represent might just be multiple trees, not a single crazy tree trying to do it all. IE: let rule = {
chain: [
{
className: "highlight",
match: /\\section/,
followedBy: optional(WHITESPACE)
},
{
optional: true,
chain: [ /\[/, SHORT_FORM, /\]/ ],
followedBy: optional(WHITESPACE)
},
{
chain: [ /\{/, LONG_FORM, /\}/ ]
},
]
} The proper way to compile this may be [very abstractly]: {
variants: [
{ chain: [ /\\section/, [ /\[/, SHORT_FORM, /\]/ ], [ /\{/, LONG_FORM, /\}/ ] },
{ chain: [ /\\section/, [ /\{/, LONG_FORM, /\}/ ] },
]
} Internally you'd end up being a tree... That would be easier to debug since at any given branch you could figure out WHICH tree you were inside of rather than staring at a SINGLE massive mode trying to cover all possibilities. And within each chain you're still dealign with just a singular sequence of the same type of chaining. Maybe it's a crazy idea, I dunno. :) It also feels a lot more like real parsing a well. :-/ Yes, that would get gnarly for a long list of optionals, but again to that I'd say: we're not a parser... and at the point we're jumping thru too many hoops we may need to step back and simplify things or move the goal posts. |
Just because that was quite a lot to read... if you want to move forward with utmost speed (without waiting or exploring the use of any of the new things in the works) then I'd suggest:
Also, perhaps reconsider is ALL this complexity really needed for MOST common cases? If there are 2-3 weird contextual cases could those be matched with an EXPLICIT rule and then just have a list of generic rules to handle all the remaining more boring cases? For example, it seems the real "odd duck" here is VERBATIM_DELIMITED_EQUAL... (which could even be On an entirely related/unrelated note... if you had free rein to just write a [small] JS parser to just parse LaTex however you pleased and then split out a token list/tree at the end... would that make this problem easier? Harder? |
Would there be a single real mid-size document/manuscript somewhere that used 95% of common LaTex that I could use to play around? It would really help to have something large to go back to to perhaps answer some of my own questions. The tests aren't very good at this because I feel they have a lot of hidden assumptions baked in. (and often don't represent real content - ie, latex PLUS text content, etc) |
Lets take one simple example of simplifying:
If we don't care about options (we don't color them or give them a class) why do we need to match it at all? Why not just match:
But I assume it's about their submodes (the could contain a control seq, etc) so then why not:
Why MUST the highlighter KNOW that href has a single optional arg? Why can't optional arguments be handled the way you handle whitespace in many places? Consume it if it's there just to get onto the next thing we truly care about... |
Also relevant: #2838 |
Are there any good small, light-weight Latex parser/lexers already written in JS that could do all the tokenization heavy lifting for us, etc? Context: my last comment here highlightjs/highlightjs-rdflang#2 |
Adding for reference since the linked branch above many not live forever. Auto-closing.
|
On Mode Chaining
There are two obvious ways of chaining modes, i.e. putting them together in such a way that they are applied consecutively, with no mode being applied more than once.
starts: {}
could have been omitted, but was added because it is present in the simplest implementation.AC
would be completely matched by this chain (but only theA
inAXBC
).A
inAC
would be matched by this chain (as well as inAXBC
).These two types of chain can be combined: The second kind can be contained in the first kind. Motivated by this, I shall call the first kind an outer chain and the second kind an inner chain hereafter. Indicating the desired input syntax, I shall denote the first chain above as
[A, B, C]
and the second as[[A, B, C]]
. A chain containing two (inner) subchains could then be[A, [B, C], [D, E, F]]
.Mode Chaining in LaTeX
In (La)TeX, there is often a need for sequential modes. For example, the macro
has an optional argument (in brackets:
short form
) and a mandatory argument (in braces:long form
). Also, spaces and a single linebreak (but not two) are allowed before each argument.This could easily be matched using a chain:
An easy way to input chains like this would provide a simple way of highlighting every part of such a chain.
Changes
I added functions to automatically chain modes in the way outlined above, as well as some utility functions for applying them to highlighting LaTeX. I replaced the verbatim rules, that already used chaining but in a somewhat unclear manner, with equivalent ones using the new chaining functions.
The functions were designed to provide a simple interface, so that the following works as expected:
I believe that this change makes the rules easier to understand and maintain (once the concepts described above are known). It also greatly simplifies adding new rules for highlighting certain elements like sections, environments, document classes or used packages, as was contemplated in #2726.
Checklist
Added markup tests, or they don't apply here because...
The behavior of the grammar is not changed by this pull request.
Updated the changelog at
CHANGES.md
Should I do this now or when this is finalized? Or are you taking care of this? I read something about constant merge conflicts..
Added myself to
AUTHORS.txt
, under ContributorsSame question. Should I already have done this last time around?