Description
I wanted to collect some thoughts on tree data structures here. We've been discussing this over at julia-vscode/JuliaWorkspaces.jl#7 but I'd like to have a link to that discussion here. And also a persistent central place to discuss the JuliaSyntax trees.
Trees we have
Green trees
Currently, we've got a GreenNode
inspired by Roslyn and rust-analyzer. This acts as a raw syntax tree with subtrees which are "reusable" in the sense that they can be spliced into another parent green tree. The main benefit of this reusability would be if we implemented incremental parsing.
As discussed in julia-vscode/JuliaWorkspaces.jl#7 (comment), GreenNode
being immutable means that the GC pressure of materializing a full green tree is quite low because the leaves are stored inline in the child array of their parent nodes.
Some numbers for base/abstractarray.jl which is a moderately large (~100 kb / 3000 line) Julia file:
- There are 26534 tokens and 32731 nodes but only 6299 allocations in parsing this file to a green tree. In some sense, we are already winning by a factor of 5x here in terms of GC pressure by having
GreenNode
be immutable. - 81% are leaf nodes which are never separately allocated
- 8% of nodes have only leaves as children. 73 % of these can be reused leading to only 4466 allocations. (But the lookup to do the reuse is slower than just allocating in this case and the win here doesn't seem amazing.)
- For perspective, parsing without tree construction via
JuliaSyntax.parse!()
costs only 90 allocations.
using JuliaSyntax, BenchmarkTools
f = read(joinpath(Sys.BINDIR, Base.DATAROOTDIR, "julia", "base", "abstractarray.jl"), String)
# Parse to GreenNode
@benchmark JuliaSyntax.parseall(JuliaSyntax.GreenNode, f)
# Raw parse to ParseStream array data structure
@benchmark JuliaSyntax.parse!(JuliaSyntax.ParseStream(f))
Convenience interface for syntax trivia
GreenNode
isn't very convenient to use. This is a big hole in the API right now, as there's no other way to access the syntax trivia. See also #2
We could
- Create an iterator interface which makes
GreenNode
more convenient to use - Or, maybe, enhance
SyntaxNode
with an easier way to access trivia children
Abstract Syntax - SyntaxNode
We've currently got SyntaxNode
for this - it's somewhat like Expr
in the sense that
- It contains abstract syntax - syntax trivia aren't accessible in a simple way
- It contains eagerly materialized Julia values in the leaves of the tree.
- It is mutable
In contrast to Expr
, it has the same strict child source ordering and head
s as GreenNode
.
SyntaxNode
also has a parent
field, but I'm not sure this is worthwhile! A good iterator/cursor interface might be more efficient and convenient?
Typed Syntax - TypedSyntaxNode
The TypedSyntax
package https://github.com/JuliaDebug/Cthulhu.jl/tree/master/TypedSyntax contains an analog to SyntaxNode
but with a bunch of type annotations and other useful type-related data in the TreeData
. See
https://github.com/JuliaDebug/Cthulhu.jl/blob/master/TypedSyntax/src/node.jl
Generalizing tree attributes - trees we want
What's the right way for analysis passes to attach annotations to a tree? One way to do this is by putting them in the tree data structure itself - this is what the parameterized TreeNode{TreeData}
achieves (used by SyntaxNode
and TypedSyntaxNode
). But this makes it hard to compose different types of passes together without having TreeData
become fat and unwieldy, or having different passes know about each other.
A way around this is to store only an id per tree node and to store attributes in separate arrays or dictionaries indexed by this id. For sparse attributes an entity component system (ECS) setup Overseer.jl is excellent: attributes are the components, compiler passes are the systems, and tree nodes are the entities. @BenChung has been trying this approach as noted in julia-vscode/JuliaWorkspaces.jl#7 (comment)
For non-sparse attributes a set of parallel arrays is simpler though: think of a DataFrame
or TypedTable
where the rows are indexed by node ID and the columns are attributes. If we looked into the Julia graphs ecosystem I'd guess they store their attributes this way.
Tree API: traversal and child access
Currently, one traverses a tree with a "big pile of if statements" based on kind(node)
and children(node)
- essentially the Expr
model of working with trees.
However, this is pretty cumbersome
- Users need to know how Julia source maps to node
kind
s - Users need to know what the ordering of children means. User code contains things like
node[3]
- but what does that3
mean? It's different in each different situation.
This situation is a mess!
- Users need to remember a lot of fiddly detail. Expression manipulation code is hard to read with all the integer literals.
- There's no separation between representation and API: any change to the tree data layout is a breaking change.
I think the right way to fix this is pattern matching: have a @match
API which expands to the big pile of if
statements based on kind
which the user would otherwise have to write by hand. This the approach that rust-analyzer uses (admittedly Rust has language-native pattern matching but we can do it with a macro). It's also roughly the approach that Ben is taking in SemanticAST.jl, though that version still requires knowing the meaning of a given child ordering.