-
-
Notifications
You must be signed in to change notification settings - Fork 43
Separate token output from internal nodes in ParseStream #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This data rearrangement gives a cleaner separation between tokens (which keep track of bytes in the source text) vs internal tree nodes (which keep track of which tokens they cover). As a result it reduces the size of the intermediate data structures. As part of rewriting build_tree to use the new data structures it's also become much faster and building the green tree no longer dominates the parsing time (probably due to fixing some type stability issues).
Keno
added a commit
that referenced
this pull request
Jun 9, 2025
## Background I've written about 5 parsers that use the general red/tree green tree pattern. Now that we're using JuliaSyntax in base, I'd like to replace some of them by a version based on JuliaSyntax, so that I can avoid having to multiple copies of similar infrastructure. As a result, I'm taking a close look at some of the internals of JuliaSyntax. ## Current Design One thing that I really like about JuliaSyntax is that the parser basically produces a flat output buffer (well two in the current design, after #19). In essence, the output is a post-order depth-first traversal of the parse tree, each node annotated with the range of covered by this range. From there, it is possible to recover the parse tree without re-parsing by partitioning the token list according to the ranges of the non-terminal tokens. One particular application of this is to re-build a pointer-y green tree structure that stores relative by ranges and serves the same incremental parsing purpose as green tree representations in other system. The single-output-buffer design is a great innovation over the pointer-y system. It's much easier to handle and it also enforces important invariants by construction (or at least makes them easy to check). However, I think the whole post-parse tree construction logic is reducing the value of it significantly. In particular, green trees are supposed to be able to serve as compact, persistent representations of parse tree. However, here the compact, persistent representation (the output memory buffer) is not usable as a green tree. We do have the pointer-y `GreenNode` tree, but this has all the same downsides that the single buffer system was supposed to avoid. It uses explicit vectors in every node and even constructing it from the parser output allocates a nontrivial amount of memory to recover the tree structure. ## Proposed design This PR proposed to change the parser output to be directly usable as a green-tree in-situ by changing the post-order dfs traversal to instead produce (byte, node) spans (note that this is the same data as in the current `GreenNode`, except that the node span is implicit in the length of the vector and that here the children are implicit by the position in the output). This does essentially mean semantically reverting #19, but the representation proposed here is more compact than both main and the pre-#19 representation. In particular, the output is now a sequence of: ``` struct RawGreenNode head::SyntaxHead # Kind,flags byte_span::UInt32 # Number of bytes covered by this range # If NON_TERMINAL_FLAG is set, this is the total number of child nodes # Otherwise this is a terminal node (i.e. a token) and this is orig_kind node_span_or_orig_kind::UInt32 end ``` The structure is used for both terminals and non-terminals, with the iterpretation differing between them for the last field. This is marginally more compact than the current token list representation on current `main`, because we do not store the `next_byte` pointer (which would instead have to be recovered from the green tree using the usual `O(log n)` algorithm). However, because we store `node_span`, this data structure provides linear time traversal (in reverse order) over the children of the current ndoe. In particular, this means that the tree structure is manifest and does not require the allocation of temporary stacks to recover the tree structure. As a result, the output buffer can now be used as an efficient, persistent, green tree representation. I think the primary weird thing about this design is that the iteration over the children must happen in reverse order. The current GreenNode design has constant time access to all children. Of course, a lookup table for this can be computed in linear time with smaller memory than GreenNode design, but it's important to point out this limitation. That said, for transformation uses cases (e.g. to Expr or Syntax node), constant time access to the children is not really required (although the children are being produced backwards, which looks a little funny). That said, to avoid any disruption to downstream users, the `GreenNode` design itself is not changed to use this faster alternative. We can consider doing so in a later PR. ## Benchmark The motivation for this change is not performance, but rather representational cleanliness. That said, it's of course imperative that this not degrade performance. Fortunately, the benchmarks show that this is in fact marginally faster for `Expr` construction, largely because we get to avoid the additional memory allocation traffic from having the tree structure explicitly represented. Parse time itself is essentially unchanged (which is unsurprising, since we're primarily changing what's being put into the output - although the parser does a few lookback-style operations in a few places).
Keno
added a commit
that referenced
this pull request
Jun 9, 2025
## Background I've written about 5 parsers that use the general red/tree green tree pattern. Now that we're using JuliaSyntax in base, I'd like to replace some of them by a version based on JuliaSyntax, so that I can avoid having to multiple copies of similar infrastructure. As a result, I'm taking a close look at some of the internals of JuliaSyntax. ## Current Design One thing that I really like about JuliaSyntax is that the parser basically produces a flat output buffer (well two in the current design, after #19). In essence, the output is a post-order depth-first traversal of the parse tree, each node annotated with the range of covered by this range. From there, it is possible to recover the parse tree without re-parsing by partitioning the token list according to the ranges of the non-terminal tokens. One particular application of this is to re-build a pointer-y green tree structure that stores relative by ranges and serves the same incremental parsing purpose as green tree representations in other system. The single-output-buffer design is a great innovation over the pointer-y system. It's much easier to handle and it also enforces important invariants by construction (or at least makes them easy to check). However, I think the whole post-parse tree construction logic is reducing the value of it significantly. In particular, green trees are supposed to be able to serve as compact, persistent representations of parse tree. However, here the compact, persistent representation (the output memory buffer) is not usable as a green tree. We do have the pointer-y `GreenNode` tree, but this has all the same downsides that the single buffer system was supposed to avoid. It uses explicit vectors in every node and even constructing it from the parser output allocates a nontrivial amount of memory to recover the tree structure. ## Proposed design This PR proposed to change the parser output to be directly usable as a green-tree in-situ by changing the post-order dfs traversal to instead produce (byte, node) spans (note that this is the same data as in the current `GreenNode`, except that the node span is implicit in the length of the vector and that here the children are implicit by the position in the output). This does essentially mean semantically reverting #19, but the representation proposed here is more compact than both main and the pre-#19 representation. In particular, the output is now a sequence of: ``` struct RawGreenNode head::SyntaxHead # Kind,flags byte_span::UInt32 # Number of bytes covered by this range # If NON_TERMINAL_FLAG is set, this is the total number of child nodes # Otherwise this is a terminal node (i.e. a token) and this is orig_kind node_span_or_orig_kind::UInt32 end ``` The structure is used for both terminals and non-terminals, with the iterpretation differing between them for the last field. This is marginally more compact than the current token list representation on current `main`, because we do not store the `next_byte` pointer (which would instead have to be recovered from the green tree using the usual `O(log n)` algorithm). However, because we store `node_span`, this data structure provides linear time traversal (in reverse order) over the children of the current ndoe. In particular, this means that the tree structure is manifest and does not require the allocation of temporary stacks to recover the tree structure. As a result, the output buffer can now be used as an efficient, persistent, green tree representation. I think the primary weird thing about this design is that the iteration over the children must happen in reverse order. The current GreenNode design has constant time access to all children. Of course, a lookup table for this can be computed in linear time with smaller memory than GreenNode design, but it's important to point out this limitation. That said, for transformation uses cases (e.g. to Expr or Syntax node), constant time access to the children is not really required (although the children are being produced backwards, which looks a little funny). That said, to avoid any disruption to downstream users, the `GreenNode` design itself is not changed to use this faster alternative. We can consider doing so in a later PR. ## Benchmark The motivation for this change is not performance, but rather representational cleanliness. That said, it's of course imperative that this not degrade performance. Fortunately, the benchmarks show that this is in fact marginally faster for `Expr` construction, largely because we get to avoid the additional memory allocation traffic from having the tree structure explicitly represented. Parse time itself is essentially unchanged (which is unsurprising, since we're primarily changing what's being put into the output - although the parser does a few lookback-style operations in a few places).
Keno
added a commit
that referenced
this pull request
Jun 9, 2025
## Background I've written about 5 parsers that use the general red/tree green tree pattern. Now that we're using JuliaSyntax in base, I'd like to replace some of them by a version based on JuliaSyntax, so that I can avoid having to multiple copies of similar infrastructure. As a result, I'm taking a close look at some of the internals of JuliaSyntax. ## Current Design One thing that I really like about JuliaSyntax is that the parser basically produces a flat output buffer (well two in the current design, after #19). In essence, the output is a post-order depth-first traversal of the parse tree, each node annotated with the range of covered by this range. From there, it is possible to recover the parse tree without re-parsing by partitioning the token list according to the ranges of the non-terminal tokens. One particular application of this is to re-build a pointer-y green tree structure that stores relative by ranges and serves the same incremental parsing purpose as green tree representations in other system. The single-output-buffer design is a great innovation over the pointer-y system. It's much easier to handle and it also enforces important invariants by construction (or at least makes them easy to check). However, I think the whole post-parse tree construction logic is reducing the value of it significantly. In particular, green trees are supposed to be able to serve as compact, persistent representations of parse tree. However, here the compact, persistent representation (the output memory buffer) is not usable as a green tree. We do have the pointer-y `GreenNode` tree, but this has all the same downsides that the single buffer system was supposed to avoid. It uses explicit vectors in every node and even constructing it from the parser output allocates a nontrivial amount of memory to recover the tree structure. ## Proposed design This PR proposed to change the parser output to be directly usable as a green-tree in-situ by changing the post-order dfs traversal to instead produce (byte, node) spans (note that this is the same data as in the current `GreenNode`, except that the node span is implicit in the length of the vector and that here the children are implicit by the position in the output). This does essentially mean semantically reverting #19, but the representation proposed here is more compact than both main and the pre-#19 representation. In particular, the output is now a sequence of: ``` struct RawGreenNode head::SyntaxHead # Kind,flags byte_span::UInt32 # Number of bytes covered by this range # If NON_TERMINAL_FLAG is set, this is the total number of child nodes # Otherwise this is a terminal node (i.e. a token) and this is orig_kind node_span_or_orig_kind::UInt32 end ``` The structure is used for both terminals and non-terminals, with the iterpretation differing between them for the last field. This is marginally more compact than the current token list representation on current `main`, because we do not store the `next_byte` pointer (which would instead have to be recovered from the green tree using the usual `O(log n)` algorithm). However, because we store `node_span`, this data structure provides linear time traversal (in reverse order) over the children of the current ndoe. In particular, this means that the tree structure is manifest and does not require the allocation of temporary stacks to recover the tree structure. As a result, the output buffer can now be used as an efficient, persistent, green tree representation. I think the primary weird thing about this design is that the iteration over the children must happen in reverse order. The current GreenNode design has constant time access to all children. Of course, a lookup table for this can be computed in linear time with smaller memory than GreenNode design, but it's important to point out this limitation. That said, for transformation uses cases (e.g. to Expr or Syntax node), constant time access to the children is not really required (although the children are being produced backwards, which looks a little funny). That said, to avoid any disruption to downstream users, the `GreenNode` design itself is not changed to use this faster alternative. We can consider doing so in a later PR. ## Benchmark The motivation for this change is not performance, but rather representational cleanliness. That said, it's of course imperative that this not degrade performance. Fortunately, the benchmarks show that this is in fact marginally faster for `Expr` construction, largely because we get to avoid the additional memory allocation traffic from having the tree structure explicitly represented. Parse time itself is essentially unchanged (which is unsurprising, since we're primarily changing what's being put into the output - although the parser does a few lookback-style operations in a few places).
Keno
added a commit
that referenced
this pull request
Jun 9, 2025
## Background I've written about 5 parsers that use the general red/tree green tree pattern. Now that we're using JuliaSyntax in base, I'd like to replace some of them by a version based on JuliaSyntax, so that I can avoid having to multiple copies of similar infrastructure. As a result, I'm taking a close look at some of the internals of JuliaSyntax. ## Current Design One thing that I really like about JuliaSyntax is that the parser basically produces a flat output buffer (well two in the current design, after #19). In essence, the output is a post-order depth-first traversal of the parse tree, each node annotated with the range of covered by this range. From there, it is possible to recover the parse tree without re-parsing by partitioning the token list according to the ranges of the non-terminal tokens. One particular application of this is to re-build a pointer-y green tree structure that stores relative by ranges and serves the same incremental parsing purpose as green tree representations in other system. The single-output-buffer design is a great innovation over the pointer-y system. It's much easier to handle and it also enforces important invariants by construction (or at least makes them easy to check). However, I think the whole post-parse tree construction logic is reducing the value of it significantly. In particular, green trees are supposed to be able to serve as compact, persistent representations of parse tree. However, here the compact, persistent representation (the output memory buffer) is not usable as a green tree. We do have the pointer-y `GreenNode` tree, but this has all the same downsides that the single buffer system was supposed to avoid. It uses explicit vectors in every node and even constructing it from the parser output allocates a nontrivial amount of memory to recover the tree structure. ## Proposed design This PR proposed to change the parser output to be directly usable as a green-tree in-situ by changing the post-order dfs traversal to instead produce (byte, node) spans (note that this is the same data as in the current `GreenNode`, except that the node span is implicit in the length of the vector and that here the children are implicit by the position in the output). This does essentially mean semantically reverting #19, but the representation proposed here is more compact than both main and the pre-#19 representation. In particular, the output is now a sequence of: ``` struct RawGreenNode head::SyntaxHead # Kind,flags byte_span::UInt32 # Number of bytes covered by this range # If NON_TERMINAL_FLAG is set, this is the total number of child nodes # Otherwise this is a terminal node (i.e. a token) and this is orig_kind node_span_or_orig_kind::UInt32 end ``` The structure is used for both terminals and non-terminals, with the iterpretation differing between them for the last field. This is marginally more compact than the current token list representation on current `main`, because we do not store the `next_byte` pointer (which would instead have to be recovered from the green tree using the usual `O(log n)` algorithm). However, because we store `node_span`, this data structure provides linear time traversal (in reverse order) over the children of the current ndoe. In particular, this means that the tree structure is manifest and does not require the allocation of temporary stacks to recover the tree structure. As a result, the output buffer can now be used as an efficient, persistent, green tree representation. I think the primary weird thing about this design is that the iteration over the children must happen in reverse order. The current GreenNode design has constant time access to all children. Of course, a lookup table for this can be computed in linear time with smaller memory than GreenNode design, but it's important to point out this limitation. That said, for transformation uses cases (e.g. to Expr or Syntax node), constant time access to the children is not really required (although the children are being produced backwards, which looks a little funny). That said, to avoid any disruption to downstream users, the `GreenNode` design itself is not changed to use this faster alternative. We can consider doing so in a later PR. ## Benchmark The motivation for this change is not performance, but rather representational cleanliness. That said, it's of course imperative that this not degrade performance. Fortunately, the benchmarks show that this is in fact marginally faster for `Expr` construction, largely because we get to avoid the additional memory allocation traffic from having the tree structure explicitly represented. Parse time itself is essentially unchanged (which is unsurprising, since we're primarily changing what's being put into the output - although the parser does a few lookback-style operations in a few places).
Keno
added a commit
that referenced
this pull request
Jun 9, 2025
## Background I've written about 5 parsers that use the general red/tree green tree pattern. Now that we're using JuliaSyntax in base, I'd like to replace some of them by a version based on JuliaSyntax, so that I can avoid having to multiple copies of similar infrastructure. As a result, I'm taking a close look at some of the internals of JuliaSyntax. ## Current Design One thing that I really like about JuliaSyntax is that the parser basically produces a flat output buffer (well two in the current design, after #19). In essence, the output is a post-order depth-first traversal of the parse tree, each node annotated with the range of covered by this range. From there, it is possible to recover the parse tree without re-parsing by partitioning the token list according to the ranges of the non-terminal tokens. One particular application of this is to re-build a pointer-y green tree structure that stores relative by ranges and serves the same incremental parsing purpose as green tree representations in other system. The single-output-buffer design is a great innovation over the pointer-y system. It's much easier to handle and it also enforces important invariants by construction (or at least makes them easy to check). However, I think the whole post-parse tree construction logic is reducing the value of it significantly. In particular, green trees are supposed to be able to serve as compact, persistent representations of parse tree. However, here the compact, persistent representation (the output memory buffer) is not usable as a green tree. We do have the pointer-y `GreenNode` tree, but this has all the same downsides that the single buffer system was supposed to avoid. It uses explicit vectors in every node and even constructing it from the parser output allocates a nontrivial amount of memory to recover the tree structure. ## Proposed design This PR proposed to change the parser output to be directly usable as a green-tree in-situ by changing the post-order dfs traversal to instead produce (byte, node) spans (note that this is the same data as in the current `GreenNode`, except that the node span is implicit in the length of the vector and that here the children are implicit by the position in the output). This does essentially mean semantically reverting #19, but the representation proposed here is more compact than both main and the pre-#19 representation. In particular, the output is now a sequence of: ``` struct RawGreenNode head::SyntaxHead # Kind,flags byte_span::UInt32 # Number of bytes covered by this range # If NON_TERMINAL_FLAG is set, this is the total number of child nodes # Otherwise this is a terminal node (i.e. a token) and this is orig_kind node_span_or_orig_kind::UInt32 end ``` The structure is used for both terminals and non-terminals, with the iterpretation differing between them for the last field. This is marginally more compact than the current token list representation on current `main`, because we do not store the `next_byte` pointer (which would instead have to be recovered from the green tree using the usual `O(log n)` algorithm). However, because we store `node_span`, this data structure provides linear time traversal (in reverse order) over the children of the current ndoe. In particular, this means that the tree structure is manifest and does not require the allocation of temporary stacks to recover the tree structure. As a result, the output buffer can now be used as an efficient, persistent, green tree representation. I think the primary weird thing about this design is that the iteration over the children must happen in reverse order. The current GreenNode design has constant time access to all children. Of course, a lookup table for this can be computed in linear time with smaller memory than GreenNode design, but it's important to point out this limitation. That said, for transformation uses cases (e.g. to Expr or Syntax node), constant time access to the children is not really required (although the children are being produced backwards, which looks a little funny). That said, to avoid any disruption to downstream users, the `GreenNode` design itself is not changed to use this faster alternative. We can consider doing so in a later PR. ## Benchmark The motivation for this change is not performance, but rather representational cleanliness. That said, it's of course imperative that this not degrade performance. Fortunately, the benchmarks show that this is in fact marginally faster for `Expr` construction, largely because we get to avoid the additional memory allocation traffic from having the tree structure explicitly represented. Parse time itself is essentially unchanged (which is unsurprising, since we're primarily changing what's being put into the output - although the parser does a few lookback-style operations in a few places).
Keno
added a commit
that referenced
this pull request
Jun 9, 2025
## Background I've written about 5 parsers that use the general red/tree green tree pattern. Now that we're using JuliaSyntax in base, I'd like to replace some of them by a version based on JuliaSyntax, so that I can avoid having to multiple copies of similar infrastructure. As a result, I'm taking a close look at some of the internals of JuliaSyntax. ## Current Design One thing that I really like about JuliaSyntax is that the parser basically produces a flat output buffer (well two in the current design, after #19). In essence, the output is a post-order depth-first traversal of the parse tree, each node annotated with the range of covered by this range. From there, it is possible to recover the parse tree without re-parsing by partitioning the token list according to the ranges of the non-terminal tokens. One particular application of this is to re-build a pointer-y green tree structure that stores relative by ranges and serves the same incremental parsing purpose as green tree representations in other system. The single-output-buffer design is a great innovation over the pointer-y system. It's much easier to handle and it also enforces important invariants by construction (or at least makes them easy to check). However, I think the whole post-parse tree construction logic is reducing the value of it significantly. In particular, green trees are supposed to be able to serve as compact, persistent representations of parse tree. However, here the compact, persistent representation (the output memory buffer) is not usable as a green tree. We do have the pointer-y `GreenNode` tree, but this has all the same downsides that the single buffer system was supposed to avoid. It uses explicit vectors in every node and even constructing it from the parser output allocates a nontrivial amount of memory to recover the tree structure. ## Proposed design This PR proposed to change the parser output to be directly usable as a green-tree in-situ by changing the post-order dfs traversal to instead produce (byte, node) spans (note that this is the same data as in the current `GreenNode`, except that the node span is implicit in the length of the vector and that here the children are implicit by the position in the output). This does essentially mean semantically reverting #19, but the representation proposed here is more compact than both main and the pre-#19 representation. In particular, the output is now a sequence of: ``` struct RawGreenNode head::SyntaxHead # Kind,flags byte_span::UInt32 # Number of bytes covered by this range # If NON_TERMINAL_FLAG is set, this is the total number of child nodes # Otherwise this is a terminal node (i.e. a token) and this is orig_kind node_span_or_orig_kind::UInt32 end ``` The structure is used for both terminals and non-terminals, with the iterpretation differing between them for the last field. This is marginally more compact than the current token list representation on current `main`, because we do not store the `next_byte` pointer (which would instead have to be recovered from the green tree using the usual `O(log n)` algorithm). However, because we store `node_span`, this data structure provides linear time traversal (in reverse order) over the children of the current ndoe. In particular, this means that the tree structure is manifest and does not require the allocation of temporary stacks to recover the tree structure. As a result, the output buffer can now be used as an efficient, persistent, green tree representation. I think the primary weird thing about this design is that the iteration over the children must happen in reverse order. The current GreenNode design has constant time access to all children. Of course, a lookup table for this can be computed in linear time with smaller memory than GreenNode design, but it's important to point out this limitation. That said, for transformation uses cases (e.g. to Expr or Syntax node), constant time access to the children is not really required (although the children are being produced backwards, which looks a little funny). That said, to avoid any disruption to downstream users, the `GreenNode` design itself is not changed to use this faster alternative. We can consider doing so in a later PR. ## Benchmark The motivation for this change is not performance, but rather representational cleanliness. That said, it's of course imperative that this not degrade performance. Fortunately, the benchmarks show that this is in fact marginally faster for `Expr` construction, largely because we get to avoid the additional memory allocation traffic from having the tree structure explicitly represented. Parse time itself is essentially unchanged (which is unsurprising, since we're primarily changing what's being put into the output - although the parser does a few lookback-style operations in a few places).
Keno
added a commit
that referenced
this pull request
Jun 12, 2025
…560) ## Background I've written about 5 parsers that use the general red/tree green tree pattern. Now that we're using JuliaSyntax in base, I'd like to replace some of them by a version based on JuliaSyntax, so that I can avoid having to multiple copies of similar infrastructure. As a result, I'm taking a close look at some of the internals of JuliaSyntax. ## Current Design One thing that I really like about JuliaSyntax is that the parser basically produces a flat output buffer (well two in the current design, after #19). In essence, the output is a post-order depth-first traversal of the parse tree, each node annotated with the range of covered by this range. From there, it is possible to recover the parse tree without re-parsing by partitioning the token list according to the ranges of the non-terminal tokens. One particular application of this is to re-build a pointer-y green tree structure that stores relative by ranges and serves the same incremental parsing purpose as green tree representations in other system. The single-output-buffer design is a great innovation over the pointer-y system. It's much easier to handle and it also enforces important invariants by construction (or at least makes them easy to check). However, I think the whole post-parse tree construction logic is reducing the value of it significantly. In particular, green trees are supposed to be able to serve as compact, persistent representations of parse tree. However, here the compact, persistent representation (the output memory buffer) is not usable as a green tree. We do have the pointer-y `GreenNode` tree, but this has all the same downsides that the single buffer system was supposed to avoid. It uses explicit vectors in every node and even constructing it from the parser output allocates a nontrivial amount of memory to recover the tree structure. ## Proposed design This PR proposed to change the parser output to be directly usable as a green-tree in-situ by changing the post-order dfs traversal to instead produce (byte, node) spans (note that this is the same data as in the current `GreenNode`, except that the node span is implicit in the length of the vector and that here the children are implicit by the position in the output). This does essentially mean semantically reverting #19, but the representation proposed here is more compact than both main and the pre-#19 representation. In particular, the output is now a sequence of: ``` struct RawGreenNode head::SyntaxHead # Kind,flags byte_span::UInt32 # Number of bytes covered by this range # If NON_TERMINAL_FLAG is set, this is the total number of child nodes # Otherwise this is a terminal node (i.e. a token) and this is orig_kind node_span_or_orig_kind::UInt32 end ``` The structure is used for both terminals and non-terminals, with the iterpretation differing between them for the last field. This is marginally more compact than the current token list representation on current `main`, because we do not store the `next_byte` pointer (which would instead have to be recovered from the green tree using the usual `O(log n)` algorithm). However, because we store `node_span`, this data structure provides linear time traversal (in reverse order) over the children of the current ndoe. In particular, this means that the tree structure is manifest and does not require the allocation of temporary stacks to recover the tree structure. As a result, the output buffer can now be used as an efficient, persistent, green tree representation. I think the primary weird thing about this design is that the iteration over the children must happen in reverse order. The current GreenNode design has constant time access to all children. Of course, a lookup table for this can be computed in linear time with smaller memory than GreenNode design, but it's important to point out this limitation. That said, for transformation uses cases (e.g. to Expr or Syntax node), constant time access to the children is not really required (although the children are being produced backwards, which looks a little funny). That said, to avoid any disruption to downstream users, the `GreenNode` design itself is not changed to use this faster alternative. We can consider doing so in a later PR. ## Benchmark The motivation for this change is not performance, but rather representational cleanliness. That said, it's of course imperative that this not degrade performance. Fortunately, the benchmarks show that this is in fact marginally faster for `Expr` construction, largely because we get to avoid the additional memory allocation traffic from having the tree structure explicitly represented. Parse time itself is essentially unchanged (which is unsurprising, since we're primarily changing what's being put into the output - although the parser does a few lookback-style operations in a few places). Co-authored-by: Em Chu <61633163+mlechu@users.noreply.github.com>
c42f
added a commit
to JuliaLang/julia
that referenced
this pull request
Oct 17, 2025
…uliaSyntax.jl#19) This data rearrangement gives a cleaner separation between tokens (which keep track of bytes in the source text) vs internal tree nodes (which keep track of which tokens they cover). As a result it reduces the size of the intermediate data structures. As part of rewriting build_tree to use the new data structures it's also become much faster and building the green tree no longer dominates the parsing time (probably due to fixing some type stability issues).
c42f
pushed a commit
to JuliaLang/julia
that referenced
this pull request
Oct 17, 2025
…uliaLang/JuliaSyntax.jl#560) ## Background I've written about 5 parsers that use the general red/tree green tree pattern. Now that we're using JuliaSyntax in base, I'd like to replace some of them by a version based on JuliaSyntax, so that I can avoid having to multiple copies of similar infrastructure. As a result, I'm taking a close look at some of the internals of JuliaSyntax. ## Current Design One thing that I really like about JuliaSyntax is that the parser basically produces a flat output buffer (well two in the current design, after JuliaLang/JuliaSyntax.jl#19). In essence, the output is a post-order depth-first traversal of the parse tree, each node annotated with the range of covered by this range. From there, it is possible to recover the parse tree without re-parsing by partitioning the token list according to the ranges of the non-terminal tokens. One particular application of this is to re-build a pointer-y green tree structure that stores relative by ranges and serves the same incremental parsing purpose as green tree representations in other system. The single-output-buffer design is a great innovation over the pointer-y system. It's much easier to handle and it also enforces important invariants by construction (or at least makes them easy to check). However, I think the whole post-parse tree construction logic is reducing the value of it significantly. In particular, green trees are supposed to be able to serve as compact, persistent representations of parse tree. However, here the compact, persistent representation (the output memory buffer) is not usable as a green tree. We do have the pointer-y `GreenNode` tree, but this has all the same downsides that the single buffer system was supposed to avoid. It uses explicit vectors in every node and even constructing it from the parser output allocates a nontrivial amount of memory to recover the tree structure. ## Proposed design This PR proposed to change the parser output to be directly usable as a green-tree in-situ by changing the post-order dfs traversal to instead produce (byte, node) spans (note that this is the same data as in the current `GreenNode`, except that the node span is implicit in the length of the vector and that here the children are implicit by the position in the output). This does essentially mean semantically reverting JuliaLang/JuliaSyntax.jl#19, but the representation proposed here is more compact than both main and the pre-JuliaLang/JuliaSyntax.jl#19 representation. In particular, the output is now a sequence of: ``` struct RawGreenNode head::SyntaxHead # Kind,flags byte_span::UInt32 # Number of bytes covered by this range # If NON_TERMINAL_FLAG is set, this is the total number of child nodes # Otherwise this is a terminal node (i.e. a token) and this is orig_kind node_span_or_orig_kind::UInt32 end ``` The structure is used for both terminals and non-terminals, with the iterpretation differing between them for the last field. This is marginally more compact than the current token list representation on current `main`, because we do not store the `next_byte` pointer (which would instead have to be recovered from the green tree using the usual `O(log n)` algorithm). However, because we store `node_span`, this data structure provides linear time traversal (in reverse order) over the children of the current ndoe. In particular, this means that the tree structure is manifest and does not require the allocation of temporary stacks to recover the tree structure. As a result, the output buffer can now be used as an efficient, persistent, green tree representation. I think the primary weird thing about this design is that the iteration over the children must happen in reverse order. The current GreenNode design has constant time access to all children. Of course, a lookup table for this can be computed in linear time with smaller memory than GreenNode design, but it's important to point out this limitation. That said, for transformation uses cases (e.g. to Expr or Syntax node), constant time access to the children is not really required (although the children are being produced backwards, which looks a little funny). That said, to avoid any disruption to downstream users, the `GreenNode` design itself is not changed to use this faster alternative. We can consider doing so in a later PR. ## Benchmark The motivation for this change is not performance, but rather representational cleanliness. That said, it's of course imperative that this not degrade performance. Fortunately, the benchmarks show that this is in fact marginally faster for `Expr` construction, largely because we get to avoid the additional memory allocation traffic from having the tree structure explicitly represented. Parse time itself is essentially unchanged (which is unsurprising, since we're primarily changing what's being put into the output - although the parser does a few lookback-style operations in a few places). Co-authored-by: Em Chu <61633163+mlechu@users.noreply.github.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This data rearrangement gives a cleaner separation between tokens (which
keep track of bytes in the source text) vs internal tree nodes (which
keep track of which tokens they cover). As a result it reduces the size
of the intermediate data structures.
As part of rewriting build_tree to use the new data structures it's also
become much faster and building the green tree no longer dominates the
parsing time (probably due to fixing some type stability issues).
With this change we're around 14x the speed of the flisp parser in producing
the green tree, and about 6x faster in producing
Exprdata structures.