Skip to content

Tree data structures #230

Open
Open
@c42f

Description

@c42f

I wanted to collect some thoughts on tree data structures here. We've been discussing this over at julia-vscode/JuliaWorkspaces.jl#7 but I'd like to have a link to that discussion here. And also a persistent central place to discuss the JuliaSyntax trees.

Trees we have

Green trees

Currently, we've got a GreenNode inspired by Roslyn and rust-analyzer. This acts as a raw syntax tree with subtrees which are "reusable" in the sense that they can be spliced into another parent green tree. The main benefit of this reusability would be if we implemented incremental parsing.

As discussed in julia-vscode/JuliaWorkspaces.jl#7 (comment), GreenNode being immutable means that the GC pressure of materializing a full green tree is quite low because the leaves are stored inline in the child array of their parent nodes.

Some numbers for base/abstractarray.jl which is a moderately large (~100 kb / 3000 line) Julia file:

  • There are 26534 tokens and 32731 nodes but only 6299 allocations in parsing this file to a green tree. In some sense, we are already winning by a factor of 5x here in terms of GC pressure by having GreenNode be immutable.
  • 81% are leaf nodes which are never separately allocated
  • 8% of nodes have only leaves as children. 73 % of these can be reused leading to only 4466 allocations. (But the lookup to do the reuse is slower than just allocating in this case and the win here doesn't seem amazing.)
  • For perspective, parsing without tree construction via JuliaSyntax.parse!() costs only 90 allocations.
using JuliaSyntax, BenchmarkTools
f = read(joinpath(Sys.BINDIR, Base.DATAROOTDIR, "julia", "base", "abstractarray.jl"), String)
# Parse to GreenNode
@benchmark JuliaSyntax.parseall(JuliaSyntax.GreenNode, f)
# Raw parse to ParseStream array data structure
@benchmark JuliaSyntax.parse!(JuliaSyntax.ParseStream(f))

Convenience interface for syntax trivia

GreenNode isn't very convenient to use. This is a big hole in the API right now, as there's no other way to access the syntax trivia. See also #2

We could

  • Create an iterator interface which makes GreenNode more convenient to use
  • Or, maybe, enhance SyntaxNode with an easier way to access trivia children

Abstract Syntax - SyntaxNode

We've currently got SyntaxNode for this - it's somewhat like Expr in the sense that

  • It contains abstract syntax - syntax trivia aren't accessible in a simple way
  • It contains eagerly materialized Julia values in the leaves of the tree.
  • It is mutable

In contrast to Expr, it has the same strict child source ordering and heads as GreenNode.

SyntaxNode also has a parent field, but I'm not sure this is worthwhile! A good iterator/cursor interface might be more efficient and convenient?

Typed Syntax - TypedSyntaxNode

The TypedSyntax package https://github.com/JuliaDebug/Cthulhu.jl/tree/master/TypedSyntax contains an analog to SyntaxNode but with a bunch of type annotations and other useful type-related data in the TreeData. See

https://github.com/JuliaDebug/Cthulhu.jl/blob/master/TypedSyntax/src/node.jl

Generalizing tree attributes - trees we want

What's the right way for analysis passes to attach annotations to a tree? One way to do this is by putting them in the tree data structure itself - this is what the parameterized TreeNode{TreeData} achieves (used by SyntaxNode and TypedSyntaxNode). But this makes it hard to compose different types of passes together without having TreeData become fat and unwieldy, or having different passes know about each other.

A way around this is to store only an id per tree node and to store attributes in separate arrays or dictionaries indexed by this id. For sparse attributes an entity component system (ECS) setup Overseer.jl is excellent: attributes are the components, compiler passes are the systems, and tree nodes are the entities. @BenChung has been trying this approach as noted in julia-vscode/JuliaWorkspaces.jl#7 (comment)

For non-sparse attributes a set of parallel arrays is simpler though: think of a DataFrame or TypedTable where the rows are indexed by node ID and the columns are attributes. If we looked into the Julia graphs ecosystem I'd guess they store their attributes this way.

Tree API: traversal and child access

Currently, one traverses a tree with a "big pile of if statements" based on kind(node) and children(node) - essentially the Expr model of working with trees.

However, this is pretty cumbersome

  • Users need to know how Julia source maps to node kinds
  • Users need to know what the ordering of children means. User code contains things like node[3] - but what does that 3 mean? It's different in each different situation.

This situation is a mess!

  • Users need to remember a lot of fiddly detail. Expression manipulation code is hard to read with all the integer literals.
  • There's no separation between representation and API: any change to the tree data layout is a breaking change.

I think the right way to fix this is pattern matching: have a @match API which expands to the big pile of if statements based on kind which the user would otherwise have to write by hand. This the approach that rust-analyzer uses (admittedly Rust has language-native pattern matching but we can do it with a macro). It's also roughly the approach that Ben is taking in SemanticAST.jl, though that version still requires knowing the meaning of a given child ordering.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions