Thanks for the interest in contributing to this project! Next you'll find some general explanation about the project and how to run it locally.
To get more familiar with tree-sitter itself and writing tree-sitter grammars, you may want to read https://tree-sitter.github.io/tree-sitter/creating-parsers.
Most tree-sitter grammars are written using a single grammar.js
file with a declarative-like syntax.
But reStructuredText isn't a programming language with a well defined specification, it has a lot of edge cases, and a text can have a different meaning depending on the context it is located or its indentation level.
Tree-sitter is flexible enough that it lets us write some rules in C
(external scanner),
so for the reason above, our grammar will make heavy use of this feature.
Tree-sitter is a LR(k) parser, so we can't backtrack.
Our external scanner must share some logic while recognizing some nodes.
For example, if we find a *
character,
we first try to see if it's a list element,
then an emphasis node, then a strong node, etc.
Most of the time when something isn't a recognizable node, it is interpreted as a simple text.
The external scanner also allow us to keep some state between each parsing of a node, this is currently used to keep track of the indentation levels.
Most of the files on the repository are auto-generated by tree-sitter, they are needed for the grammar to be compiled easily on the user's computer, so they are committed in the repository.
Some of the files that aren't auto-generated are:
grammar.js
: it defines all nodes that our grammar has and its structure.src/scanner.c
: the entry point to our custom scanner, to make it easier to maintain the code that isn't auto-generated is inside thesrc/tree_sitter_rst/
directory.src/tree_sitter_rst/scanner.c
: it contains functions used to create/serialize/de-serialize our custom scanner, and it also has the main entry point to our custom scanner:rst_scanner_scan
(AKA, the big collection ofif
s).src/tree_sitter_rst/tokens.h
: defines all tokens that our external scanner recognize, they are the same that are declared in theexternals
attribute in ourgrammar.js
file.src/tree_sitter_rst/chars.c
: some utility functions to recognize characters, like numbers, bullets, letters, etc.src/tree_sitter_rst/parser.c
: here are all functions that match the current text being parsed to a validtoken
.test/corpus/
: tests for our grammar so we are sure nothing breaks when changing stuff, you can read about the syntax at https://tree-sitter.github.io/tree-sitter/creating-parsers#command-test.test/examples/
: these are the files that docutils uses to run their tests, we parse then without checking the resulting CST, we only care if our parser errors in the process.docs/
: this directory is deployed to GitHub pages https://stsewd.dev/tree-sitter-rst/.
Requirements:
- Node
- A C compiler (clang is preferred)
- Docker (only if you want to see your changes on the browser)
Install the requirements with:
npm install
To build the grammar:
npm run build
To run the tests:
npm run test
Note: if you changed the grammar, you need to re-build it for tests to use the new grammar.
Test the grammar by parsing a file:
npm run parse -- test.rst
Test the grammar on your browser:
npm run web
Note: if you changed the grammar, you need to rebuild it and run
npm run wasm
(requires docker).
Some times you may find useful to compare the output of docutils for a given RST document, since the reStructuredText specification doesn't contain/explain all edge cases.
pip install docutils
rst2html5.py test.rst out.html
xdg-open out.html