Description
Context: Shiki offers a JavaScript regex engine that includes highly accurate Oniguruma emulation, and supports the vast majority of TextMate grammars. The JS engine allows lightweight use of TM grammars without requiring a large Oniguruma WASM binary. This is especially beneficial in browsers, for better load/start time. The JS engine is also faster to execute some grammars since the regexes run with native JS performance.
In the rare cases where a grammar (like C#) isn't supported, it's due to one of the following reasons:
- Bugs in the TM grammar (invalid Oniguruma regexes that fail silently in
vscode-textmate
, but the error is not ignored by the JS engine). - Bugs in Oniguruma that the JS engine doesn't reproduce (leading to the JS engine rather than Oniguruma providing the intended highlighting).
- Bugs in
oniguruma-to-es
which I maintain and is used under the hood by the JS engine. A lot of work has gone into making the emulation extremely accurate, so this is rare but possible. - Use of a small number of rare regex features that are not yet emulated by
oniguruma-to-es
, or are not possible/feasible to emulate.
The C# grammar falls under number 4, due to one regex using multiple overlapping recursions. If this regex in the C# grammar was refactored, the grammar would be supported (as would the Razor grammar, since Razor's one erroring regex comes from embedding the C# grammar).
The problematic regex
Here's the regex in question.
And here's the problematic portion:
<
(?<type_args>
[^<>()]++|
<\g<type_args>*+>|
\(\g<type_args>*+\)
)*+
>
Potential solution
Perhaps this can be refactored to simply establish the boundaries of the type args, balancing the angled brackets but not the parentheses, like so:
<
(?<type_args>
[^<>]++|
<\g<type_args>*+>
)*+
>
Then, optionally, a subsequent regex could be used to check for and highlight the balanced parentheses inside.
Alternatively, a depth limit could be imposed for either parens or brackets. For example:
<
(?<type_args>
[^<>()] |
\( (?: [^()<>] | < [^()<>]* > | \( [^()<>]* \) )* \) |
< \g<type_args>* >
)*
>
The above is very similar to the original in what it matches. It doesn't allow anything unbalanced, and it allows any depth of nested brackets (well, actually a depth limit of 20 since that's Oniguruma's recursion limit). But whenever it encounters a new set of parens, it only allows one level of nested parens or brackets inside it. So, for example, it continues to accurately match e.g. <>
, <a>
, <a<>>
, <<a>>
, <()>
, <(())>
, <a(a)>
, <(<>)>
, <<>()>
, and <(())<<<<(<>())>>>>>
, and it continues to accurately not match e.g. <a(>
, <)>
, <a(><)>
, <a<)>>
, <<(<<<)>>
, <<(>>
, <<)>>
, and <<(>)>
. The difference is that it doesn't match <((()))>
, though the current regex does. However, it could be edited to allow more levels of nesting within parens (I'd be happy to do so if that helps!) -- it would just be a bit verbose.
Overlapping recursions
The reason oniguruma-to-es
doesn't allow emulating overlapping recursions is because it can get unmanageable very fast. Oniguruma uses a depth limit of 20, and oniguruma-to-es
uses 20 as the default/max limit. Expanding a regex pattern in JS RegExp
source with up to 20×20 expansions (for a basic case of overlapping recursion) is maybe on the edge of manageable, but as soon as you do something like (20×20)×(20×20) it seems like a very bad idea. In this case, the C# grammar is doing something that looks so simple 😊 but I believe would be 20 factorial times 2 (20!×2) expansions, which of course is unworkable.
What do you think? Are you open to refactoring/removing the overlapping recursion in the grammar?
Although this would currently only add support for your grammar to Shiki's JS engine, the JS engine (which is new) has seen significant adoption as it has continued to improve to its current robust state. I also expect more libraries that work with TM grammars to adopt oniguruma-to-es
in the future. Removing this one case of overlapping recursion would mean that the C# and Razor grammars are supported by all of them.