Description
If the plan is to document the lexer separately from the grammar, there are choices to make about how to describe its output.
One choice is whether the lexer's output for punctuation should be described as consisting of fine-grained (single-character) tokens (like the ones procedural macros use), compound tokens (like the tt
fragment specifier in macros-by-example uses), or some mixture.
If the lexer doesn't use compound tokens, there's then a choice of how to allow the lexer's clients to distinguish (say) &&
from & &
in cases where it's necessary.
There's also a choice of whether lifetimes use fine-grained or compound tokens.
Clients of the lexer
There are three parts of Rust which consume the lexer's output:
- the main Rust parser
- the Procedural Macros system
- the "Macros by example" system
The proc-macro system works in terms of fine-grained tokens; macros-by-example (when using the tt
fragment specifier) works in terms of compound tokens.
The parser could be specified using either fine-grained or compound tokens, or a mixture of the two.
I think in practice this means that, whatever choice is made for the main description of the lexer, the "Lexing/tokenization" chapter should have additional text describing how to convert its output to the other form(s), which the chapters for macros could refer to.
Input to the parser
(I'm assuming that the intent is to accurately describe rustc's current behaviour, for the "descriptive" view of the spec.)
Using compound tokens
If the grammar is defined in terms of compound tokens, there are rules which need to match both a single-character token and a compound token beginning with that character:
-
&&
in BorrowExpression, ReferencePattern, and ReferenceType -
||
in ClosureExpression -
<
in many places (I think everywhere it appears, except comparison expressions) -
>
in many places (I think everywhere it appears, except comparison expressions)
Brian Leibig's grammar gives a reasonable idea of how this approach might end up.
With this approach, the description of how the parsed source is converted to an AST would also have to deal this complication.
Using fine-grained tokens
If the grammar is defined in terms of fine-grained tokens, its rules will sometimes need to make a distinction depending on whether punctuation tokens were separated by whitespace.
For example:
a << b
must be accepted as a left-shift expression, buta < < b
must nota && b
must be taken as a LazyBooleanExpression, not treated as ambiguous witha & &b
- but
a &- b
must be treated as equivalent toa & -b
- but
I think that for <
, >
, and |
it's sufficient to know whether there is spacing before the next token, but the second example above shows that for &
what matters is whether it's immediately followed by another &
.
One approach might be to describe the lexer's output as including explicit whitespace tokens (and presumably introduce some convenience notation in the grammar so that they don't usually have to be mentioned).
Another approach might be to say that there are two distinct tokens used to represent each relevant punctuation character, as suggested by this comment:
You distinguish "
>
followed by>
" as one token and ">
followed by something else" as another kind of token.
A third approach might be to use some form of attribute grammar, and say that the input tokens have attributes like "Spacing" and "Is immediately followed by another &
".
With any of these approaches, I think the description of how the parsed source is converted to an AST would remain reasonably simple.
Rejected joined forms when using fine-grained tokens
There are some cases where parses which might naturally be allowed must be rejected when punctuation characters are not separated by whitespace.
I know of the following:
-
SomeStruct { field:::std::i32::MAX }
must be rejected, not treated as equivalent toSomeStruct { field: ::std::i32::MAX }
-
fn f(s:::std::string::String)
must be rejected, not treated as equivalent tofn f(s: ::std::string::String)
I expect there are more of a similar sort.
This is perhaps an argument for using a compound token for ::
even if fine-grained tokens are used in other cases.
There's also this:
a <- b
must be rejected, not taken as equivalent toa < -b
but perhaps that's really analogous to a && b
(an ambiguity is being resolved in favour of an obsolete unstable feature).
Input to procedural macros
Procedural macros see fine-grained Punct
tokens, which have a Spacing
property which indicates whether there was whitespace before the following token.
Lifetime-or-label ('foo
) is represented as a Punct
token with joint spacing followed by an Ident
token.
The documentation as of Rust 1.76 has this to say about Punct
's Spacing
property:
Joint
[…] in token streams parsed from source code, the compiler will only set spacing to
Joint
in the following cases.
When a
Punct
is immediately followed by anotherPunct
without a whitespace. E.g.+
isJoint
in+=
and++
.When a single quote
'
is immediately followed by an identifier without a whitespace. E.g.'
isJoint
in'lifetime
.This list may be extended in the future to enable more token combinations.
Alone
[…] In token streams parsed from source code, the compiler will set spacing to
Alone
in all cases not covered by the conditions forJoint
above.
E.g.+
isAlone
in+ =
,+ident
and+()
.
In particular, tokens not followed by anything will be marked asAlone
.
Input to "Macros by example" macros
By-example macros using the tt
fragment specifier see the following combinations of punctuation as compound tokens:
<=
==
!=
>=
&&
||
..
...
..=
::
->
<-
=>
<<
>>
+=
-=
*=
/=
%=
^=
&=
|=
<<=
>>=
Lifetime-or-label ('foo
) is represented as a single token.
Sources of information
The current rustc implementation
(As of rustc 1.76)
The lower-level lexer in rustc_lexer
emits fine-grained tokens for punctuation (but compound tokens for lifetime-or-label). It emits explicit tokens representing whitespace.
The higher-level lexer in rustc_parse::lexer
emits compound tokens for punctuation. It represents whitespace as 'Spacing' information describing the relationship to the following token.
The parser breaks those compound tokens up again where necessary.
(As I understand it, there is consensus that ideally rustc would switch to using fine-grained tokens internally.)
The breaking-up takes place in the following functions in rustc_parse
, which might be used to audit for cases where a grammar based on compound tokens would need special cases.
expect_and()
expect_or()
eat_lt()
expect_lt()
expect_gt()
eat_plus()
In particular, rustc takes care to split +=
when the +
appears in generic bounds, but I think that isn't currently making any difference (I don't think a following =
can ever parse).
Currently maintained documentation
The Reference
The Lexical structure chapter of the Reference uses compound tokens for punctuation and lifetime-or-label.
The grammar used in the Reference is written in a mixed style.
In most places it's written as if it's working with fine-grained tokens; notation like &&
can be taken as representing two &
tokens without intervening whitespace.
In three cases (BorrowExpression, ReferencePattern, and ClosureExpression) it lists both the fine-grained and the compound tokens explicitly, sometimes with an explanation in the text.
It treats LIFETIME_OR_LABEL
as a single token.
The Ferrocene spec
The Ferrocene spec's lexical description and grammar are close to the Reference's.
The lexical part was forked from the Reference sometime in late 2021, and has no interesting additions.
Its grammar doesn't include the separately-listed compound tokens that the Reference has.
Older formalisations
Rustypop
Rustypop from 2016 used fine-grained tokens to deal with <<
, >>
, &&
, and ||
.
Otherwise it used compound tokens (in particular, for <=
, <<=
<-
, >=
, >>=
, and ::
).
Its README describes its approach to dealing with multiple-punctuation symbols in the grammar.
Brian Leibig's grammar
Brian Leibig's grammar (last updated in 2017) uses compound tokens throughout.
It's the only grammar of that sort I've found that tries to deal with all cases where compound tokens need to be accepted as if they were multiple fine-grained ones.
wg-grammar
The wg-grammar grammar from 2019 appears to have used compound tokens for both punctuation and lifetimes.
I don't think that project got as far as dealing with the resulting complications. rust-lang/wg-grammar#47 has a little discussion.
rust-lang/wg-grammar#3 includes extensive discussion on how the lexer might be modelled.
One result of that discussion was https://github.com/CAD97/rust-lexical-spec , which would have output fine-grained punctuation tokens and explicit whitespace tokens.
Metadata
Metadata
Assignees
Type
Projects
Status