Tree data structures

I wanted to collect some thoughts on tree data structures here. We've been discussing this over at https://github.com/julia-vscode/JuliaWorkspaces.jl/issues/7 but I'd like to have a link to that discussion here.  And also a persistent central place to discuss the JuliaSyntax trees.

## Trees we have

### Green trees

Currently, we've got a `GreenNode` inspired by Roslyn and rust-analyzer. This acts as a raw syntax tree with subtrees which are "reusable" in the sense that they can be spliced into another parent green tree. The main benefit of this reusability would be if we implemented incremental parsing.

As discussed in https://github.com/julia-vscode/JuliaWorkspaces.jl/issues/7#issuecomment-1483743325, `GreenNode` being immutable means that the GC pressure of materializing a full green tree is quite low because the leaves are stored inline in the child array of their parent nodes.

Some numbers for base/abstractarray.jl which is a moderately large (~100 kb / 3000 line) Julia file:
* There are 26534 tokens and 32731 nodes but only 6299 allocations in parsing this file to a green tree. In some sense, we are already winning by a factor of 5x here in terms of GC pressure by having `GreenNode` be immutable.
* 81% are leaf nodes which are never separately allocated
* 8% of nodes have only leaves as children. 73 % of these can be reused leading to only 4466 allocations. (But the lookup to do the reuse is slower than just allocating in this case and the win here doesn't seem amazing.)
* For perspective, parsing *without* tree construction via `JuliaSyntax.parse!()` costs only 90 allocations.

```julia
using JuliaSyntax, BenchmarkTools
f = read(joinpath(Sys.BINDIR, Base.DATAROOTDIR, "julia", "base", "abstractarray.jl"), String)
# Parse to GreenNode
@benchmark JuliaSyntax.parseall(JuliaSyntax.GreenNode, f)
# Raw parse to ParseStream array data structure
@benchmark JuliaSyntax.parse!(JuliaSyntax.ParseStream(f))
```

### Convenience interface for syntax trivia

`GreenNode` isn't very convenient to use. This is a big hole in the API right now, as there's no other way to access the syntax trivia. See also https://github.com/JuliaLang/JuliaSyntax.jl/issues/2

We could
* Create an iterator interface which makes `GreenNode` more convenient to use
* Or, maybe, enhance `SyntaxNode` with an easier way to access trivia children

### Abstract Syntax - `SyntaxNode`

We've currently got `SyntaxNode` for this - it's somewhat like `Expr` in the sense that
* It contains abstract syntax - syntax trivia aren't accessible in a simple way
* It contains eagerly materialized Julia *values* in the leaves of the tree.
* It is mutable

In contrast to `Expr`, it has the same strict child source ordering and `head`s as `GreenNode`.

`SyntaxNode` also has a `parent` field, but I'm not sure this is worthwhile! A good iterator/cursor interface might be more efficient and convenient?

### Typed Syntax - `TypedSyntaxNode`

The `TypedSyntax` package https://github.com/JuliaDebug/Cthulhu.jl/tree/master/TypedSyntax contains an analog to `SyntaxNode` but with a bunch of type annotations and other useful type-related data in the `TreeData`. See

https://github.com/JuliaDebug/Cthulhu.jl/blob/master/TypedSyntax/src/node.jl

## Generalizing tree attributes - trees we want

What's the right way for analysis passes to attach annotations to a tree? One way to do this is by putting them in the tree data structure itself - this is what the parameterized `TreeNode{TreeData}` achieves (used by `SyntaxNode` and `TypedSyntaxNode`). But this makes it hard to compose different types of passes together without having `TreeData` become fat and unwieldy, or having different passes know about each other.

A way around this is to store only an id per tree node and to store attributes in separate arrays or dictionaries indexed by this id. For sparse attributes an entity component system (ECS) setup Overseer.jl is excellent: attributes are the components, compiler passes are the systems, and tree nodes are the entities. @benchung has been trying this approach as noted in https://github.com/julia-vscode/JuliaWorkspaces.jl/issues/7#issuecomment-1483719069

For non-sparse attributes a set of parallel arrays is simpler though: think of a `DataFrame` or `TypedTable` where the rows are indexed by node ID and the columns are attributes. If we looked into the Julia graphs ecosystem I'd guess they store their attributes this way.

## Tree API: traversal and child access

Currently, one traverses a tree with a "big pile of if statements" based on `kind(node)` and `children(node)` - essentially the `Expr` model of working with trees.

However, this is pretty cumbersome
* Users need to know how Julia source maps to node `kind`s 
* Users need to know what the ordering of children means. User code contains things like `node[3]` - but what does that `3` mean? It's different in each different situation.

This situation is a mess!
* Users need to remember a lot of fiddly detail. Expression manipulation code is hard to read with all the integer literals.
* There's no separation between representation and API: any change to the tree data layout is a breaking change.

I think the right way to fix this is pattern matching: have a `@match` API which expands to the big pile of `if` statements based on `kind` which the user would otherwise have to write by hand. This the approach that rust-analyzer uses (admittedly Rust has language-native pattern matching but we can do it with a macro).  It's also roughly the approach that Ben is taking in [SemanticAST.jl](https://github.com/BenChung/SemanticAST.jl/blob/master/src/analyzer.jl), though that version still requires knowing the meaning of a given child ordering.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Tree data structures #230

Trees we have

Green trees

Convenience interface for syntax trivia

Abstract Syntax - `SyntaxNode`

Typed Syntax - `TypedSyntaxNode`

Generalizing tree attributes - trees we want

Tree API: traversal and child access

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Tree data structures #230

Description

Trees we have

Green trees

Convenience interface for syntax trivia

Abstract Syntax - SyntaxNode

Typed Syntax - TypedSyntaxNode

Generalizing tree attributes - trees we want

Tree API: traversal and child access

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Abstract Syntax - `SyntaxNode`

Typed Syntax - `TypedSyntaxNode`