Update nu to v0.0.5 by andrewthauer · Pull Request #2068 · zed-industries/extensions

andrewthauer · 2025-02-17T16:37:23Z

Includes:

andrewthauer · 2025-02-18T14:13:39Z

Not exactly sure why this is failing CI. My previous update to this extension was essentially the exact same. The only difference is this update has a few more changes to the extensions grammar, etc.

maxdeviant · 2025-02-20T21:11:17Z

Not exactly sure why this is failing CI. My previous update to this extension was essentially the exact same. The only difference is this update has a few more changes to the extensions grammar, etc.

I suspect it is a problem with building the Tree-sitter grammar.

We have had some grammars that take too long to build on our GitHub Actions runners.

maxdeviant · 2025-02-20T21:21:51Z

Not exactly sure why this is failing CI. My previous update to this extension was essentially the exact same. The only difference is this update has a few more changes to the extensions grammar, etc.

I suspect it is a problem with building the Tree-sitter grammar.

We have had some grammars that take too long to build on our GitHub Actions runners.

Even on my local machine (MacBook Pro M3 Max) it seems to get stuck compiling the Nu parser:

2025-02-20T16:17:47.650669-05:00 [INFO] compiling Rust extension /Users/maxdeviant/projects/zed-nu
2025-02-20T16:17:47.729619-05:00 [INFO] compiling Rust crate for extension /Users/maxdeviant/projects/zed-nu
2025-02-20T16:17:56.862076-05:00 [INFO] compiled Rust crate for extension /Users/maxdeviant/projects/zed-nu
2025-02-20T16:17:56.86706-05:00 [INFO] encoding wasm component for extension /Users/maxdeviant/projects/zed-nu
2025-02-20T16:17:56.87095-05:00 [INFO] extension /Users/maxdeviant/projects/zed-nu written to /Users/maxdeviant/projects/zed-nu/extension.wasm
2025-02-20T16:17:56.871224-05:00 [INFO] compiled Rust extension /Users/maxdeviant/projects/zed-nu
2025-02-20T16:17:56.871285-05:00 [INFO] compiling grammar nu for extension /Users/maxdeviant/projects/zed-nu
2025-02-20T16:17:56.872012-05:00 [INFO] checking out nu parser
2025-02-20T16:17:57.555452-05:00 [INFO] compiling nu parser

I've been sitting here for several minutes and it still hasn't completed.

andrewthauer · 2025-02-21T11:05:57Z

I've been sitting here for several minutes and it still hasn't completed.

Interesting, that explains why I was having trouble testing this. I looked at the logs and assumed it just didn't show when it completed compilng. I couldn't explain why sometimes the language syntax wasn't registered, then later on it showed as registered and finally worked. I guess it just took a long time to build. Is there any way to get more detailed diagnostics as to what is going on under the covers? I searched and searched but can't see any verbose logging options for dev extensions.

fdncred · 2025-02-21T12:27:31Z

Not sure what's going on. On my mac I can easil build tree-sitter-nu in about 10 seconds. 7 seconds for tree-sitter generate and 3 seconds for tree-sitter build.

Maybe @JosephTLyons knows?

maxbrunsfeld · 2025-02-21T17:16:26Z

I remember we had problems building the tree-sitter-nu parser in the past. We made some optimizations to both Tree-sitter's parser generation and the Nu grammar, and it resolved the issue.

It seems that some of the Tree-sitter-nu changes that have happened since then have slowed down the compile again.

fdncred · 2025-02-21T19:48:11Z

Yup, I remember those changes fondly @maxbrunsfeld nushell/tree-sitter-nu#85. That was 100 PRs ago, wow time flies, lol. @blindFS has made quite a few changes recently, maybe they can help get the WASM build performance tolerable?

maxbrunsfeld · 2025-02-21T20:15:40Z

@fdncred I'm looking at the Nu grammar, and I see it does something very odd where it loads this very big file of specific shell commands, and builds them directly in the grammar, as a token called cmd_identifier.

This results in the lexer function having thousands and thousands of states for matching all of these specific strings. I don't think this is good. IMO, this code should be removed. It's not efficient to have a lexer match all of this long specific strings. I don't know why things like rustup toolchain uninstall or gh repo fork should be built into the deepest level of the language syntax.

It looks like this code was introduced in nushell/tree-sitter-nu#161. That PR looks like at added about 1200 lex states. That's not the only reason the Nu lexer is huge, but is one big piece.

maxbrunsfeld · 2025-02-21T20:36:45Z

To investigate, I'd look at the logs since nushell/tree-sitter-nu#85, and see which PRs caused the greatest increase in the number of cases in the generated ts_lex(). It used to be in the 2000s (which is still way larger than most grammars), but now it's > 5700.

I think the fix is going to involve simplifying the tokens by making things more permissive, or by extracting common tokens instead of having many very-similar tokens.

fdncred · 2025-02-21T21:34:24Z

Thanks for the investigation @maxbrunsfeld. The key to the text file was to get tree-sitter-nu to recognize the nushell commands with spaces in them. Each of those commands in the text file is a nushell command with spaces in them. We were having issues where "str substring" wouldn't be recognized properly and then also highlighting was broken where str was one color and substring was another.

We're all for simplifying things but it would be nice be able to recognize commands with spaces in them as one command. That's all we could really figure out.

There was also an attempt here nushell/tree-sitter-nu#160 and also an attempt here nushell/tree-sitter-nu#156 to get the highlighting working right.

maxbrunsfeld · 2025-02-21T23:20:47Z

I think you can probably recognize commands with spaces in them in a generic way, as opposed to hard-coding a large set of specific commands. I'm guessing that Nu's production parser doesn't hard code the commands.

blindFS · 2025-02-22T00:16:21Z

@fdncred I'm looking at the Nu grammar, and I see it does something very odd where it loads this very big file of specific shell commands, and builds them directly in the grammar, as a token called cmd_identifier.

This results in the lexer function having thousands and thousands of states for matching all of these specific strings. I don't think this is good. IMO, this code should be removed. It's not efficient to have a lexer match all of this long specific strings. I don't know why things like rustup toolchain uninstall or gh repo fork should be built into the deepest level of the language syntax.

It looks like this code was introduced in nushell/tree-sitter-nu#161. That PR looks like at added about 1200 lex states. That's not the only reason the Nu lexer is huge, but is one big piece.

That was a questionable PR, and I have tested the performance lost, the parsing speed dropped 7-8% on my machine. And we thought it was an acceptable trade off, so let that in.

As for the command recognition, AFAIK, it's not possible to tell whether foo bar is a single command or a foo command with bar as its unquoted string argument in lexing stage.

Anyway, I generally agree with you that generating a hard coded list of commands as rules is too much hack. Moreover the unquoted rules in the parser is also deeply involved, I once tried to migrate that to the external scanner, but that's a very heavy task and I gave up. You probably know some tricks to simplify that piece too.

maxbrunsfeld · 2025-02-22T00:34:54Z

I guess I don't understand the desired behavior. cargo build is a command and an argument, no different from my-cargo my-build. Why would you want to treat it specially?

fdncred · 2025-02-22T00:37:10Z

I guess I don't understand the desired behavior. cargo build is a command and an argument, no different from my-cargo my-build. Why would you want to treat it specially?

Because the command name is str substring. substring isn't a parameter/argument of str.

maxbrunsfeld · 2025-02-22T00:42:33Z

This shouldn't be happening in the lexer/parser. There can be arbitrarily many commands, that take arbitrary subcommands. They should all just be parsed as commands with arguments.

In particular, str is just a command that happens to take subcommands normally. Here are the docs for str

https://www.nushell.sh/commands/docs/str.html

Notes

You must use one of the following subcommands. Using this command as-is will only produce this help message.

fdncred · 2025-02-22T00:58:05Z

At the end of the day, I don't object to what we call it, cmd + args, cmd only or whatever. However, I do care about how it works.

I want to be able to click on str or substring and be able to get help on the "str substring" command.
If it's a custom command, I'd want to be able click on either part and be able to go to the definition.
I also want the syntax highlighting to match nushell as much as possible, not the colors necessarily but the types, and color the command properly like it is in nushell.

As an example, you can see the highlighting in the repl. Below you see the command in cyan, the argument range in magenta and the flags in blue. Again, I don't care about colors because that can be changed but you'll notice that the command is all the same color which is differently from the other parameters.

In particular, str is just a command that happens to take subcommands normally. Here are the docs for str

That's kind of a facade to help users understand better. str is a helper command that lists all the commands under the str prefix, but the commands are named like str substring. There really is no such thing as a subcommand, even though we talk about it that way. You can see that here https://github.com/nushell/nushell/blob/main/crates/nu-command/src/strings/str_/substring.rs#L26

BTW - I'm not arguing about things, just trying to help understand how nushell works. I'm up for changing whatever we need to in order to make things work better.

maxbrunsfeld · 2025-02-22T01:28:27Z

Yeah, all of that can be done at a higher level than the parser, and that would better match what nushell is doing. Their parser does not have a hard coded list of commands like this, from what I can tell reading the code.

I think we should look for ‘str replace’ and friends in the highlight query.

the go-to-def is gonna be implemented in the language server, so I don’t think it’s affected by this. If it were, I think we could add semantic hand-written logic that treats certain known built-in commands specially.

blindFS · 2025-02-22T02:09:14Z

To investigate, I'd look at the logs since nushell/tree-sitter-nu#85, and see which PRs caused the greatest increase in the number of cases in the generated ts_lex(). It used to be in the 2000s (which is still way larger than most grammars), but now it's > 5700.

I think the fix is going to involve simplifying the tokens by making things more permissive, or by extracting common tokens instead of having many very-similar tokens.

I just checked nushell/tree-sitter-nu#85 to learn how to optimize the rules. Seems to me the key is to reduce conflicts?

Then this PR: nushell/tree-sitter-nu#139 introduced many new conflicts regarding common newline patterns. It might have something to do with the slow down of WASM compilation.

It would be super helpful if you take some time to demonstrate how optimization could be done on current tree-sitter-nu (other than the hard-coded command list). And I'll try to clean the rest up if it's not too hard.

In case we might want to revert #161 as [discussed](zed-industries/extensions#2068 (comment))

maxbrunsfeld · 2025-02-23T01:19:55Z

@blindFS I think that the biggest issue for using Tree-sitter-nu in Zed is the time it takes to compile to WASM. For whatever reason, Clang's wasm backend takes much longer to compile the ts_lex function when there are a large number of states. So the most important thing is to reduce the state count (number of case statements) in that function. Unfortunately, the number is not reported right at the top of the file. It's actually a bit different from reducing the parse state count.

In order to reduce the lexer state count, the biggest opportunities I saw when I last looked were removing the many different copies of the unquoted rules and some of the multiple different identifier rules.

In particular:

can any of the different identifier rules share a common token that matches the actual word part, and then match the differing pieces (optional ? ! etc) as a separate token?
can we consolidate some of the unquoted rules (it looks like there are a bunch) into one?

blindFS · 2025-02-23T01:47:54Z

@blindFS I think that the biggest issue for using Tree-sitter-nu in Zed is the time it takes to compile to WASM. For whatever reason, Clang's wasm backend takes much longer to compile the ts_lex function when there are a large number of states. So the most important thing is to reduce the state count (number of case statements) in that function. Unfortunately, the number is not reported right at the top of the file. It's actually a bit different from reducing the parse state count.

Is there any way to check the actual state count that directly affects the WASM compilation time? Or is directly testing build -wasm easier?

In order to reduce the lexer state count, the biggest opportunities I saw when I last looked were removing the many different copies of the unquoted rules and some of the multiple different identifier rules.

In particular:

can any of the different identifier rules share a common token that matches the actual word part, and then match the differing pieces (optional ? ! etc) as a separate token?

can we consolidate some of the unquoted rules (it looks like there are a bunch) into one?

There are nuances of sets of allowed characters in them. For example, : is allowed in unquoted_in_list but not in unquoted_in_record. We can make them all fall back to the most restricted version, but that will (theoretically) diverge from the behavior of the official nu parser.

I'll try to test it on real-world nu scripts to see whether keeping those nuances is worth the cost.

maxbrunsfeld · 2025-02-24T20:04:20Z

I think it is possible to unify these rules without introducing any parse errors in valid code, but I think it would take some experimentation.

blindFS · 2025-02-24T21:56:01Z

I think it is possible to unify these rules without introducing any parse errors in valid code, but I think it would take some experimentation.

That is at the end of my wits. Would you please lend a hand?

notpeter · 2025-02-27T18:24:26Z

I'm going to go ahead and close this PR as it sounds like there is some improvements to tree-sitter-nu which need to happen as a pre-requisite. Feel free to open a subsequent PR in the future when it's ready to go.

As always, thanks @maxbrunsfeld for your help with tree-sitter internals. Learning every day!

maxbrunsfeld · 2025-02-27T19:31:12Z

Sorry @blindFS , it might be a little while before I can hack on Tree-sitter-nu, just because there is Zed stuff that we're launching that I need to focus on.

For now, all I'd suggest is trying to make the grammar simpler by making it less restrictive. For example, maybe you don't need to explicitly forbid : in unquoted_in_records: you just need to make sure that when : is used as a key-value separator, you parse it correctly. It seems like all you would need is lexical precedence: put a low precedence in unquoted. But I don't know for sure.

Is there any way to check the actual state count that directly affects the WASM compilation time?

Yeah, it's the number of case statements in ts_lex. That's what we need to reduce. Right now, Tree-sitter-nu has over 5000, compared to 194 in Tree-sitter-rust. But calling build --wasm is the ultimate test. That's basically what Zed is running in our extension CI pipeline.

blindFS · 2025-02-28T00:50:47Z

Sorry @blindFS , it might be a little while before I can hack on Tree-sitter-nu, just because there is Zed stuff that we're launching that I need to focus on.

Understood, I'll see what I can do before you have the leisure.

For now, all I'd suggest is trying to make the grammar simpler by making it less restrictive. For example, maybe you don't need to explicitly forbid : in unquoted_in_records: you just need to make sure that when : is used as a key-value separator, you parse it correctly. It seems like all you would need is lexical precedence: put a low precedence in unquoted. But I don't know for sure.

Someone already put lowest lexical precedence to the unquoted rules. But when it comes to, e.g. immediate : following some ascii text, it still prefer unquoted over (identifier . ":"). I think this is a blurred area of mixed lexical precedence and parse precedence.

Is there any way to check the actual state count that directly affects the WASM compilation time?

Yeah, it's the number of case statements in ts_lex. That's what we need to reduce. Right now, Tree-sitter-nu has over 5000, compared to 194 in Tree-sitter-rust. But calling build --wasm is the ultimate test. That's basically what Zed is running in our extension CI pipeline.

There's no way for the nu parser to get any close to the size of tree-sitter-rust, considering how dynamic and ambiguous its syntax is. A more practical goal would be on-par compiling time of tree-sitter-bash (900+ cases), still a long way to go...

maxbrunsfeld · 2025-02-28T17:15:15Z

Yeah, an external scanner is an option too, although I wouldn't think it would be needed, just to match unquoted arguments.

fdncred · 2025-03-01T12:23:59Z

The latest changes that @blindFS has made, wasm is building in 2min 7sec, down from 6min 49sec originally. Is 2min fast enough or still too slow to pass this CI?

maxdeviant · 2025-03-01T13:10:06Z

The latest changes that @blindFS has made, wasm is building in 2min 7sec, down from 6min 49sec originally. Is 2min fast enough or still too slow to pass this CI?

Try it out in CI and we can see.

fdncred · 2025-03-01T15:13:00Z

#2157

maxdeviant · 2025-03-01T15:53:39Z

#2157

Looks like it compiled!

fdncred · 2025-03-01T16:03:21Z

yay! all thanks to @blindFS

cla-bot bot added the cla-signed label Feb 17, 2025

chore: bump nu extension to v0.0.5

0a705ca

andrewthauer force-pushed the bump-nu branch from a466ae5 to 0a705ca Compare February 17, 2025 16:38

andrewthauer changed the title ~~chore(nu): bump to v0.0.5~~ chore: bump nu extension to v0.0.5 Feb 17, 2025

notpeter changed the title ~~chore: bump nu extension to v0.0.5~~ Update nu to v0.0.5 Feb 17, 2025

notpeter and others added 2 commits February 20, 2025 10:36

Merge branch 'main' into bump-nu

e8749c1

Merge branch 'main' into pr/2068

2599c86

andrewthauer mentioned this pull request Feb 21, 2025

chore: update treesitter to latest zed-extensions/nu#7

Merged

blindFS mentioned this pull request Feb 22, 2025

revert: internal commands with spaces nushell/tree-sitter-nu#184

Merged

blindFS mentioned this pull request Feb 22, 2025

Tracking issue: performance optimization nushell/tree-sitter-nu#185

Open

fdncred pushed a commit to nushell/tree-sitter-nu that referenced this pull request Feb 22, 2025

revert: internal commands with spaces (#184)

ba6db53

In case we might want to revert #161 as [discussed](zed-industries/extensions#2068 (comment))

notpeter closed this Feb 27, 2025

Conversation

andrewthauer commented Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrewthauer commented Feb 18, 2025

Uh oh!

maxdeviant commented Feb 20, 2025

Uh oh!

maxdeviant commented Feb 20, 2025

Uh oh!

andrewthauer commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fdncred commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxbrunsfeld commented Feb 21, 2025

Uh oh!

fdncred commented Feb 21, 2025

Uh oh!

maxbrunsfeld commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxbrunsfeld commented Feb 21, 2025

Uh oh!

fdncred commented Feb 21, 2025

Uh oh!

maxbrunsfeld commented Feb 21, 2025

Uh oh!

blindFS commented Feb 22, 2025

Uh oh!

maxbrunsfeld commented Feb 22, 2025

Uh oh!

fdncred commented Feb 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxbrunsfeld commented Feb 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fdncred commented Feb 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxbrunsfeld commented Feb 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blindFS commented Feb 22, 2025

Uh oh!

maxbrunsfeld commented Feb 23, 2025

Uh oh!

blindFS commented Feb 23, 2025

Uh oh!

maxbrunsfeld commented Feb 24, 2025

Uh oh!

blindFS commented Feb 24, 2025

Uh oh!

notpeter commented Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxbrunsfeld commented Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blindFS commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxbrunsfeld commented Feb 28, 2025

Uh oh!

fdncred commented Mar 1, 2025

Uh oh!

maxdeviant commented Mar 1, 2025

Uh oh!

fdncred commented Mar 1, 2025

Uh oh!

maxdeviant commented Mar 1, 2025

Uh oh!

fdncred commented Mar 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

andrewthauer commented Feb 17, 2025 •

edited

Loading

andrewthauer commented Feb 21, 2025 •

edited

Loading

fdncred commented Feb 21, 2025 •

edited

Loading

maxbrunsfeld commented Feb 21, 2025 •

edited

Loading

fdncred commented Feb 22, 2025 •

edited

Loading

maxbrunsfeld commented Feb 22, 2025 •

edited

Loading

fdncred commented Feb 22, 2025 •

edited

Loading

maxbrunsfeld commented Feb 22, 2025 •

edited

Loading

notpeter commented Feb 27, 2025 •

edited

Loading

maxbrunsfeld commented Feb 27, 2025 •

edited

Loading

blindFS commented Feb 28, 2025 •

edited

Loading