Skip to content

Output of the lexer #42

Open
Open
@mattheww

Description

@mattheww

If the plan is to document the lexer separately from the grammar, there are choices to make about how to describe its output.

One choice is whether the lexer's output for punctuation should be described as consisting of fine-grained (single-character) tokens (like the ones procedural macros use), compound tokens (like the tt fragment specifier in macros-by-example uses), or some mixture.

If the lexer doesn't use compound tokens, there's then a choice of how to allow the lexer's clients to distinguish (say) && from & & in cases where it's necessary.

There's also a choice of whether lifetimes use fine-grained or compound tokens.

Clients of the lexer

There are three parts of Rust which consume the lexer's output:

  • the main Rust parser
  • the Procedural Macros system
  • the "Macros by example" system

The proc-macro system works in terms of fine-grained tokens; macros-by-example (when using the tt fragment specifier) works in terms of compound tokens.

The parser could be specified using either fine-grained or compound tokens, or a mixture of the two.

I think in practice this means that, whatever choice is made for the main description of the lexer, the "Lexing/tokenization" chapter should have additional text describing how to convert its output to the other form(s), which the chapters for macros could refer to.

Input to the parser

(I'm assuming that the intent is to accurately describe rustc's current behaviour, for the "descriptive" view of the spec.)

Using compound tokens

If the grammar is defined in terms of compound tokens, there are rules which need to match both a single-character token and a compound token beginning with that character:

  • && in BorrowExpression, ReferencePattern, and ReferenceType

  • || in ClosureExpression

  • < in many places (I think everywhere it appears, except comparison expressions)

  • > in many places (I think everywhere it appears, except comparison expressions)

Brian Leibig's grammar gives a reasonable idea of how this approach might end up.

With this approach, the description of how the parsed source is converted to an AST would also have to deal this complication.

Using fine-grained tokens

If the grammar is defined in terms of fine-grained tokens, its rules will sometimes need to make a distinction depending on whether punctuation tokens were separated by whitespace.

For example:

  • a << b must be accepted as a left-shift expression, but a < < b must not
  • a && b must be taken as a LazyBooleanExpression, not treated as ambiguous with a & &b
    • but a &- b must be treated as equivalent to a & -b

I think that for <, >, and | it's sufficient to know whether there is spacing before the next token, but the second example above shows that for & what matters is whether it's immediately followed by another &.

One approach might be to describe the lexer's output as including explicit whitespace tokens (and presumably introduce some convenience notation in the grammar so that they don't usually have to be mentioned).

Another approach might be to say that there are two distinct tokens used to represent each relevant punctuation character, as suggested by this comment:

You distinguish "> followed by >" as one token and "> followed by something else" as another kind of token.

A third approach might be to use some form of attribute grammar, and say that the input tokens have attributes like "Spacing" and "Is immediately followed by another &".

With any of these approaches, I think the description of how the parsed source is converted to an AST would remain reasonably simple.

Rejected joined forms when using fine-grained tokens

There are some cases where parses which might naturally be allowed must be rejected when punctuation characters are not separated by whitespace.

I know of the following:

  • SomeStruct { field:::std::i32::MAX } must be rejected, not treated as equivalent to SomeStruct { field: ::std::i32::MAX }

  • fn f(s:::std::string::String) must be rejected, not treated as equivalent to fn f(s: ::std::string::String)

I expect there are more of a similar sort.

This is perhaps an argument for using a compound token for :: even if fine-grained tokens are used in other cases.

There's also this:

  • a <- b must be rejected, not taken as equivalent to a < -b

but perhaps that's really analogous to a && b (an ambiguity is being resolved in favour of an obsolete unstable feature).

Input to procedural macros

Procedural macros see fine-grained Punct tokens, which have a Spacing property which indicates whether there was whitespace before the following token.

Lifetime-or-label ('foo) is represented as a Punct token with joint spacing followed by an Ident token.

The documentation as of Rust 1.76 has this to say about Punct's Spacing property:

Joint

[…] in token streams parsed from source code, the compiler will only set spacing to Joint in the following cases.

  • When a Punct is immediately followed by another Punct without a whitespace. E.g. + is Joint in += and ++.

  • When a single quote ' is immediately followed by an identifier without a whitespace. E.g. ' is Joint in 'lifetime.

This list may be extended in the future to enable more token combinations.

Alone

[…] In token streams parsed from source code, the compiler will set spacing to Alone in all cases not covered by the conditions for Joint above.
E.g. + is Alone in + =, +ident and +().
In particular, tokens not followed by anything will be marked as Alone.

Input to "Macros by example" macros

By-example macros using the tt fragment specifier see the following combinations of punctuation as compound tokens:

  <=
  ==
  !=
  >=
  &&
  ||
  ..
  ...
  ..=
  ::
  ->
  <-
  =>
  <<
  >>
  +=
  -=
  *=
  /=
  %=
  ^=
  &=
  |=
  <<=
  >>=

Lifetime-or-label ('foo) is represented as a single token.

Sources of information

The current rustc implementation

(As of rustc 1.76)

The lower-level lexer in rustc_lexer emits fine-grained tokens for punctuation (but compound tokens for lifetime-or-label). It emits explicit tokens representing whitespace.

The higher-level lexer in rustc_parse::lexer emits compound tokens for punctuation. It represents whitespace as 'Spacing' information describing the relationship to the following token.

The parser breaks those compound tokens up again where necessary.

(As I understand it, there is consensus that ideally rustc would switch to using fine-grained tokens internally.)

The breaking-up takes place in the following functions in rustc_parse, which might be used to audit for cases where a grammar based on compound tokens would need special cases.

  • expect_and()
  • expect_or()
  • eat_lt()
  • expect_lt()
  • expect_gt()
  • eat_plus()

In particular, rustc takes care to split += when the + appears in generic bounds, but I think that isn't currently making any difference (I don't think a following = can ever parse).

Currently maintained documentation

The Reference

The Lexical structure chapter of the Reference uses compound tokens for punctuation and lifetime-or-label.

The grammar used in the Reference is written in a mixed style.

In most places it's written as if it's working with fine-grained tokens; notation like && can be taken as representing two & tokens without intervening whitespace.

In three cases (BorrowExpression, ReferencePattern, and ClosureExpression) it lists both the fine-grained and the compound tokens explicitly, sometimes with an explanation in the text.

It treats LIFETIME_OR_LABEL as a single token.

The Ferrocene spec

The Ferrocene spec's lexical description and grammar are close to the Reference's.

The lexical part was forked from the Reference sometime in late 2021, and has no interesting additions.

Its grammar doesn't include the separately-listed compound tokens that the Reference has.

Older formalisations

Rustypop

Rustypop from 2016 used fine-grained tokens to deal with <<, >>, &&, and ||.

Otherwise it used compound tokens (in particular, for <=, <<= <-, >=, >>=, and ::).

Its README describes its approach to dealing with multiple-punctuation symbols in the grammar.

Brian Leibig's grammar

Brian Leibig's grammar (last updated in 2017) uses compound tokens throughout.

It's the only grammar of that sort I've found that tries to deal with all cases where compound tokens need to be accepted as if they were multiple fine-grained ones.

wg-grammar

The wg-grammar grammar from 2019 appears to have used compound tokens for both punctuation and lifetimes.

I don't think that project got as far as dealing with the resulting complications. rust-lang/wg-grammar#47 has a little discussion.

rust-lang/wg-grammar#3 includes extensive discussion on how the lexer might be modelled.

One result of that discussion was https://github.com/CAD97/rust-lexical-spec , which would have output fine-grained punctuation tokens and explicit whitespace tokens.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-syntax-lexingArea: Related to the Lexing Chapter of the "Source code and Rust syntax tree" Section

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions