Skip to content

Update nu to v0.0.5#2068

Closed
andrewthauer wants to merge 3 commits intozed-industries:mainfrom
andrewthauer:bump-nu
Closed

Update nu to v0.0.5#2068
andrewthauer wants to merge 3 commits intozed-industries:mainfrom
andrewthauer:bump-nu

Conversation

@andrewthauer
Copy link
Copy Markdown
Contributor

@andrewthauer andrewthauer commented Feb 17, 2025

@cla-bot cla-bot bot added the cla-signed label Feb 17, 2025
@andrewthauer andrewthauer changed the title chore(nu): bump to v0.0.5 chore: bump nu extension to v0.0.5 Feb 17, 2025
@notpeter notpeter changed the title chore: bump nu extension to v0.0.5 Update nu to v0.0.5 Feb 17, 2025
@andrewthauer
Copy link
Copy Markdown
Contributor Author

Not exactly sure why this is failing CI. My previous update to this extension was essentially the exact same. The only difference is this update has a few more changes to the extensions grammar, etc.

@maxdeviant
Copy link
Copy Markdown
Member

Not exactly sure why this is failing CI. My previous update to this extension was essentially the exact same. The only difference is this update has a few more changes to the extensions grammar, etc.

I suspect it is a problem with building the Tree-sitter grammar.

We have had some grammars that take too long to build on our GitHub Actions runners.

@maxdeviant
Copy link
Copy Markdown
Member

Not exactly sure why this is failing CI. My previous update to this extension was essentially the exact same. The only difference is this update has a few more changes to the extensions grammar, etc.

I suspect it is a problem with building the Tree-sitter grammar.

We have had some grammars that take too long to build on our GitHub Actions runners.

Even on my local machine (MacBook Pro M3 Max) it seems to get stuck compiling the Nu parser:

2025-02-20T16:17:47.650669-05:00 [INFO] compiling Rust extension /Users/maxdeviant/projects/zed-nu
2025-02-20T16:17:47.729619-05:00 [INFO] compiling Rust crate for extension /Users/maxdeviant/projects/zed-nu
2025-02-20T16:17:56.862076-05:00 [INFO] compiled Rust crate for extension /Users/maxdeviant/projects/zed-nu
2025-02-20T16:17:56.86706-05:00 [INFO] encoding wasm component for extension /Users/maxdeviant/projects/zed-nu
2025-02-20T16:17:56.87095-05:00 [INFO] extension /Users/maxdeviant/projects/zed-nu written to /Users/maxdeviant/projects/zed-nu/extension.wasm
2025-02-20T16:17:56.871224-05:00 [INFO] compiled Rust extension /Users/maxdeviant/projects/zed-nu
2025-02-20T16:17:56.871285-05:00 [INFO] compiling grammar nu for extension /Users/maxdeviant/projects/zed-nu
2025-02-20T16:17:56.872012-05:00 [INFO] checking out nu parser
2025-02-20T16:17:57.555452-05:00 [INFO] compiling nu parser

I've been sitting here for several minutes and it still hasn't completed.

@andrewthauer
Copy link
Copy Markdown
Contributor Author

andrewthauer commented Feb 21, 2025

I've been sitting here for several minutes and it still hasn't completed.

Interesting, that explains why I was having trouble testing this. I looked at the logs and assumed it just didn't show when it completed compilng. I couldn't explain why sometimes the language syntax wasn't registered, then later on it showed as registered and finally worked. I guess it just took a long time to build. Is there any way to get more detailed diagnostics as to what is going on under the covers? I searched and searched but can't see any verbose logging options for dev extensions.

@fdncred
Copy link
Copy Markdown
Contributor

fdncred commented Feb 21, 2025

Not sure what's going on. On my mac I can easil build tree-sitter-nu in about 10 seconds. 7 seconds for tree-sitter generate and 3 seconds for tree-sitter build.

Maybe @JosephTLyons knows?

@maxbrunsfeld
Copy link
Copy Markdown
Contributor

I remember we had problems building the tree-sitter-nu parser in the past. We made some optimizations to both Tree-sitter's parser generation and the Nu grammar, and it resolved the issue.

It seems that some of the Tree-sitter-nu changes that have happened since then have slowed down the compile again.

@fdncred
Copy link
Copy Markdown
Contributor

fdncred commented Feb 21, 2025

Yup, I remember those changes fondly @maxbrunsfeld nushell/tree-sitter-nu#85. That was 100 PRs ago, wow time flies, lol. @blindFS has made quite a few changes recently, maybe they can help get the WASM build performance tolerable?

@maxbrunsfeld
Copy link
Copy Markdown
Contributor

maxbrunsfeld commented Feb 21, 2025

@fdncred I'm looking at the Nu grammar, and I see it does something very odd where it loads this very big file of specific shell commands, and builds them directly in the grammar, as a token called cmd_identifier.

This results in the lexer function having thousands and thousands of states for matching all of these specific strings. I don't think this is good. IMO, this code should be removed. It's not efficient to have a lexer match all of this long specific strings. I don't know why things like rustup toolchain uninstall or gh repo fork should be built into the deepest level of the language syntax.

It looks like this code was introduced in nushell/tree-sitter-nu#161. That PR looks like at added about 1200 lex states. That's not the only reason the Nu lexer is huge, but is one big piece.

@maxbrunsfeld
Copy link
Copy Markdown
Contributor

To investigate, I'd look at the logs since nushell/tree-sitter-nu#85, and see which PRs caused the greatest increase in the number of cases in the generated ts_lex(). It used to be in the 2000s (which is still way larger than most grammars), but now it's > 5700.

I think the fix is going to involve simplifying the tokens by making things more permissive, or by extracting common tokens instead of having many very-similar tokens.

@fdncred
Copy link
Copy Markdown
Contributor

fdncred commented Feb 21, 2025

Thanks for the investigation @maxbrunsfeld. The key to the text file was to get tree-sitter-nu to recognize the nushell commands with spaces in them. Each of those commands in the text file is a nushell command with spaces in them. We were having issues where "str substring" wouldn't be recognized properly and then also highlighting was broken where str was one color and substring was another.

We're all for simplifying things but it would be nice be able to recognize commands with spaces in them as one command. That's all we could really figure out.

There was also an attempt here nushell/tree-sitter-nu#160 and also an attempt here nushell/tree-sitter-nu#156 to get the highlighting working right.

@maxbrunsfeld
Copy link
Copy Markdown
Contributor

I think you can probably recognize commands with spaces in them in a generic way, as opposed to hard-coding a large set of specific commands. I'm guessing that Nu's production parser doesn't hard code the commands.

@blindFS
Copy link
Copy Markdown

blindFS commented Feb 22, 2025

@fdncred I'm looking at the Nu grammar, and I see it does something very odd where it loads this very big file of specific shell commands, and builds them directly in the grammar, as a token called cmd_identifier.

This results in the lexer function having thousands and thousands of states for matching all of these specific strings. I don't think this is good. IMO, this code should be removed. It's not efficient to have a lexer match all of this long specific strings. I don't know why things like rustup toolchain uninstall or gh repo fork should be built into the deepest level of the language syntax.

It looks like this code was introduced in nushell/tree-sitter-nu#161. That PR looks like at added about 1200 lex states. That's not the only reason the Nu lexer is huge, but is one big piece.

That was a questionable PR, and I have tested the performance lost, the parsing speed dropped 7-8% on my machine. And we thought it was an acceptable trade off, so let that in.

As for the command recognition, AFAIK, it's not possible to tell whether foo bar is a single command or a foo command with bar as its unquoted string argument in lexing stage.

Anyway, I generally agree with you that generating a hard coded list of commands as rules is too much hack. Moreover the unquoted rules in the parser is also deeply involved, I once tried to migrate that to the external scanner, but that's a very heavy task and I gave up. You probably know some tricks to simplify that piece too.

@maxbrunsfeld
Copy link
Copy Markdown
Contributor

I guess I don't understand the desired behavior. cargo build is a command and an argument, no different from my-cargo my-build. Why would you want to treat it specially?

@fdncred
Copy link
Copy Markdown
Contributor

fdncred commented Feb 22, 2025

I guess I don't understand the desired behavior. cargo build is a command and an argument, no different from my-cargo my-build. Why would you want to treat it specially?

Because the command name is str substring. substring isn't a parameter/argument of str.

@maxbrunsfeld
Copy link
Copy Markdown
Contributor

maxbrunsfeld commented Feb 22, 2025

This shouldn't be happening in the lexer/parser. There can be arbitrarily many commands, that take arbitrary subcommands. They should all just be parsed as commands with arguments.

In particular, str is just a command that happens to take subcommands normally. Here are the docs for str

https://www.nushell.sh/commands/docs/str.html

Notes

You must use one of the following subcommands. Using this command as-is will only produce this help message.

@fdncred
Copy link
Copy Markdown
Contributor

fdncred commented Feb 22, 2025

At the end of the day, I don't object to what we call it, cmd + args, cmd only or whatever. However, I do care about how it works.

  1. I want to be able to click on str or substring and be able to get help on the "str substring" command.
  2. If it's a custom command, I'd want to be able click on either part and be able to go to the definition.
  3. I also want the syntax highlighting to match nushell as much as possible, not the colors necessarily but the types, and color the command properly like it is in nushell.

As an example, you can see the highlighting in the repl. Below you see the command in cyan, the argument range in magenta and the flags in blue. Again, I don't care about colors because that can be changed but you'll notice that the command is all the same color which is differently from the other parameters.
image

In particular, str is just a command that happens to take subcommands normally. Here are the docs for str

That's kind of a facade to help users understand better. str is a helper command that lists all the commands under the str prefix, but the commands are named like str substring. There really is no such thing as a subcommand, even though we talk about it that way. You can see that here https://github.com/nushell/nushell/blob/main/crates/nu-command/src/strings/str_/substring.rs#L26

BTW - I'm not arguing about things, just trying to help understand how nushell works. I'm up for changing whatever we need to in order to make things work better.

@maxbrunsfeld
Copy link
Copy Markdown
Contributor

maxbrunsfeld commented Feb 22, 2025

Yeah, all of that can be done at a higher level than the parser, and that would better match what nushell is doing. Their parser does not have a hard coded list of commands like this, from what I can tell reading the code.

I think we should look for ‘str replace’ and friends in the highlight query.

the go-to-def is gonna be implemented in the language server, so I don’t think it’s affected by this. If it were, I think we could add semantic hand-written logic that treats certain known built-in commands specially.

@blindFS
Copy link
Copy Markdown

blindFS commented Feb 22, 2025

To investigate, I'd look at the logs since nushell/tree-sitter-nu#85, and see which PRs caused the greatest increase in the number of cases in the generated ts_lex(). It used to be in the 2000s (which is still way larger than most grammars), but now it's > 5700.

I think the fix is going to involve simplifying the tokens by making things more permissive, or by extracting common tokens instead of having many very-similar tokens.

I just checked nushell/tree-sitter-nu#85 to learn how to optimize the rules. Seems to me the key is to reduce conflicts?

Then this PR: nushell/tree-sitter-nu#139 introduced many new conflicts regarding common newline patterns. It might have something to do with the slow down of WASM compilation.

It would be super helpful if you take some time to demonstrate how optimization could be done on current tree-sitter-nu (other than the hard-coded command list). And I'll try to clean the rest up if it's not too hard.

fdncred pushed a commit to nushell/tree-sitter-nu that referenced this pull request Feb 22, 2025
@maxbrunsfeld
Copy link
Copy Markdown
Contributor

@blindFS I think that the biggest issue for using Tree-sitter-nu in Zed is the time it takes to compile to WASM. For whatever reason, Clang's wasm backend takes much longer to compile the ts_lex function when there are a large number of states. So the most important thing is to reduce the state count (number of case statements) in that function. Unfortunately, the number is not reported right at the top of the file. It's actually a bit different from reducing the parse state count.

In order to reduce the lexer state count, the biggest opportunities I saw when I last looked were removing the many different copies of the unquoted rules and some of the multiple different identifier rules.

In particular:

  • can any of the different identifier rules share a common token that matches the actual word part, and then match the differing pieces (optional ? ! etc) as a separate token?
  • can we consolidate some of the unquoted rules (it looks like there are a bunch) into one?

@blindFS
Copy link
Copy Markdown

blindFS commented Feb 23, 2025

@blindFS I think that the biggest issue for using Tree-sitter-nu in Zed is the time it takes to compile to WASM. For whatever reason, Clang's wasm backend takes much longer to compile the ts_lex function when there are a large number of states. So the most important thing is to reduce the state count (number of case statements) in that function. Unfortunately, the number is not reported right at the top of the file. It's actually a bit different from reducing the parse state count.

Is there any way to check the actual state count that directly affects the WASM compilation time? Or is directly testing build -wasm easier?

In order to reduce the lexer state count, the biggest opportunities I saw when I last looked were removing the many different copies of the unquoted rules and some of the multiple different identifier rules.

In particular:

  • can any of the different identifier rules share a common token that matches the actual word part, and then match the differing pieces (optional ? ! etc) as a separate token?
  • can we consolidate some of the unquoted rules (it looks like there are a bunch) into one?

There are nuances of sets of allowed characters in them. For example, : is allowed in unquoted_in_list but not in unquoted_in_record. We can make them all fall back to the most restricted version, but that will (theoretically) diverge from the behavior of the official nu parser.

I'll try to test it on real-world nu scripts to see whether keeping those nuances is worth the cost.

@maxbrunsfeld
Copy link
Copy Markdown
Contributor

I think it is possible to unify these rules without introducing any parse errors in valid code, but I think it would take some experimentation.

@blindFS
Copy link
Copy Markdown

blindFS commented Feb 24, 2025

I think it is possible to unify these rules without introducing any parse errors in valid code, but I think it would take some experimentation.

That is at the end of my wits. Would you please lend a hand?

@notpeter
Copy link
Copy Markdown
Contributor

notpeter commented Feb 27, 2025

I'm going to go ahead and close this PR as it sounds like there is some improvements to tree-sitter-nu which need to happen as a pre-requisite. Feel free to open a subsequent PR in the future when it's ready to go.

As always, thanks @maxbrunsfeld for your help with tree-sitter internals. Learning every day!

@notpeter notpeter closed this Feb 27, 2025
@maxbrunsfeld
Copy link
Copy Markdown
Contributor

maxbrunsfeld commented Feb 27, 2025

Sorry @blindFS , it might be a little while before I can hack on Tree-sitter-nu, just because there is Zed stuff that we're launching that I need to focus on.

For now, all I'd suggest is trying to make the grammar simpler by making it less restrictive. For example, maybe you don't need to explicitly forbid : in unquoted_in_records: you just need to make sure that when : is used as a key-value separator, you parse it correctly. It seems like all you would need is lexical precedence: put a low precedence in unquoted. But I don't know for sure.

Is there any way to check the actual state count that directly affects the WASM compilation time?

Yeah, it's the number of case statements in ts_lex. That's what we need to reduce. Right now, Tree-sitter-nu has over 5000, compared to 194 in Tree-sitter-rust. But calling build --wasm is the ultimate test. That's basically what Zed is running in our extension CI pipeline.

@blindFS
Copy link
Copy Markdown

blindFS commented Feb 28, 2025

Sorry @blindFS , it might be a little while before I can hack on Tree-sitter-nu, just because there is Zed stuff that we're launching that I need to focus on.

Understood, I'll see what I can do before you have the leisure.

For now, all I'd suggest is trying to make the grammar simpler by making it less restrictive. For example, maybe you don't need to explicitly forbid : in unquoted_in_records: you just need to make sure that when : is used as a key-value separator, you parse it correctly. It seems like all you would need is lexical precedence: put a low precedence in unquoted. But I don't know for sure.

Someone already put lowest lexical precedence to the unquoted rules. But when it comes to, e.g. immediate : following some ascii text, it still prefer unquoted over (identifier . ":"). I think this is a blurred area of mixed lexical precedence and parse precedence.

Is there any way to check the actual state count that directly affects the WASM compilation time?

Yeah, it's the number of case statements in ts_lex. That's what we need to reduce. Right now, Tree-sitter-nu has over 5000, compared to 194 in Tree-sitter-rust. But calling build --wasm is the ultimate test. That's basically what Zed is running in our extension CI pipeline.

There's no way for the nu parser to get any close to the size of tree-sitter-rust, considering how dynamic and ambiguous its syntax is. A more practical goal would be on-par compiling time of tree-sitter-bash (900+ cases), still a long way to go...

@maxbrunsfeld
Copy link
Copy Markdown
Contributor

Yeah, an external scanner is an option too, although I wouldn't think it would be needed, just to match unquoted arguments.

@fdncred
Copy link
Copy Markdown
Contributor

fdncred commented Mar 1, 2025

The latest changes that @blindFS has made, wasm is building in 2min 7sec, down from 6min 49sec originally. Is 2min fast enough or still too slow to pass this CI?

@maxdeviant
Copy link
Copy Markdown
Member

The latest changes that @blindFS has made, wasm is building in 2min 7sec, down from 6min 49sec originally. Is 2min fast enough or still too slow to pass this CI?

Try it out in CI and we can see.

@fdncred
Copy link
Copy Markdown
Contributor

fdncred commented Mar 1, 2025

#2157

@maxdeviant
Copy link
Copy Markdown
Member

#2157

Looks like it compiled!

@fdncred
Copy link
Copy Markdown
Contributor

fdncred commented Mar 1, 2025

yay! all thanks to @blindFS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants