Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some minor bugs in the Antlr grammar. #50776

Closed
modulovalue opened this issue Dec 19, 2022 · 9 comments
Closed

Some minor bugs in the Antlr grammar. #50776

modulovalue opened this issue Dec 19, 2022 · 9 comments
Assignees
Labels
area-language Dart language related items (some items might be better tracked at github.com/dart-lang/language). language-spec-parser

Comments

@modulovalue
Copy link
Contributor

I think these are some minor bugs in the Antlr grammar:


| '>' '>' '>' '='
| '>' '>' '='

  • '>' '>' '>' '=' should be a single literal i.e. '>>>=' e.g. because
void main() {
  a > > > = 2;
}

Is not valid Dart code.

  • '>' '>' '=' should be a single literal i.e. '>>=' e.g. because
void main() {
  a > > = 2;
}

is not valid Dart code.


| '>' '>' '>'
| '>' '>'

  • '>' '>' '>' should be a single literal i.e. '>>>' e.g. because
void main() {
  2 > > > 2;
}

is not valid Dart code.

  • '>' '>' should be a single literal i.e. '>>' e.g. because
void main() {
  2 > > 2;
}

is not valid Dart code.


: '>' '='

  • '>' '=' should be a single literal i.e. '>=' e.g. because
void main() {
  #> =;
}

is not valid Dart code.

| '[' ']'
| '[' ']' '='

  • '[' ']' should be a single literal i.e. '[]' e.g. because
void main() {
  #[ ];
}

is not valid Dart code.

  • '[' ']' '=' should be a single literal i.e. '[]=' e.g. because
void main() {
  #[ ] =;
}

is not valid Dart code.


: partHeader topLevelDefinition* EOF

The topLevelDefinition in the partDeclaration should allow leading metadata i.e.

partDeclaration
    :    partHeader (metadata topLevelDefinition)* EOF
    ;

e.g. because

part of lib;
@override
int? i;

Is valid Dart code.

@lrhn lrhn added area-language Dart language related items (some items might be better tracked at github.com/dart-lang/language). language-spec-parser labels Dec 19, 2022
copybara-service bot pushed a commit that referenced this issue Dec 20, 2022
Bug: #50776
Change-Id: I3c81a675ca883d39e617d7eb5c71cce9c7b03562
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/276640
Reviewed-by: William Hesse <whesse@google.com>
Commit-Queue: Erik Ernst <eernst@google.com>
@eernstg
Copy link
Member

eernstg commented Dec 20, 2022

@modulovalue, thanks for your input!

The topLevelDefinition in the partDeclaration should allow leading metadata

Indeed, good catch! Fixed in https://dart-review.googlesource.com/c/sdk/+/276640.

However, many of the things you noted are actually expressed in the given way for a reason. Here's why:

'>' '>' '>' '=' should be a single literal

If we do that then we can't parse some programs. The reason for this is that we can have things like List<List<List<int>>>== 2 (OK, this is not very useful, but it is possible), and it would not be correct for the lexer to decide that the occurrence of >>>= is a composite assignment operator, because it's intended to be three > brackets ending one actual type argument list each, followed by ==. The lexer can't make this decision (because it's operating at the level of regular languages, and bracket structure requires a context free analysis), so we have to use a lexical structure which is capable of parsing both an actual occurrence of >>>= as in x >>>= 1; and situations where the > are angle brackets around type arguments.

You could say that we should report an error if there is any whitespace between the >s, or between > and =, but the ANTLR parser won't do that. A compiler can do this after parsing (which makes this error a regular compile-time error, not a syntax error—but the compiler doesn't have to show which is which, so we won't know).

In any case, the syntax needs to be specified in a way that enables parsing, and then we can have extra rules about whitespace (intended to prevent code that "looks funny").

There are a bunch of other cases which are similar, and the same explanation applies there.

The occurrence of '[' ']' in the operator rule cannot be turned into '[]', and that's for a similar reason. If you have var x = []; then the source code [] is a list literal, not an operator, and the lexer can't make that decision. We certainly won't require whitespace in every list literal, but we would also not make it an error to have it (for example, we could encounter [ /* Elements will be added later */ ]). The situation is similar for []=.

So the main points we can make here are the following:

  • The Dart grammar needs to use these multi-token constructs in order to enable parsing of structurally different constructs where the same sequence of characters can occur.
  • We currently don't specify that whitespace cannot occur in certain locations where the grammar says it can occur (e.g., #[]= can not be written as # [ ] = according to the tools).

Cf. dart-lang/language#2735.

@modulovalue
Copy link
Contributor Author

About: >>>=, >>=, >=

because it's intended to be three > brackets ending one actual type argument list each, followed by ==

Consider the following example:

void main() {
  List<List<List<int>>>== 2;
}

Dartpad seems to recognize the >>>= in List<List<List<int>>>== 2 as being a single token and therefore the code does not compile. Are you saying that that code is expected to actually compile?

@eernstg
Copy link
Member

eernstg commented Dec 20, 2022

Yes, it should compile. Cf. #50794.

@modulovalue
Copy link
Contributor Author

About: '[]'

Thank you for pointing out my error. Another reason why my proposal about changing '[' ']' to '[]' is wrong: operator signatures need to allow whitespace between '[' ']'. I missed that. Here is an adjusted version:


and the lexer can't make that decision

Wouldn't the principle of maximal munch i.e. always consuming the largest input possible, as required by the Dart grammar, ensure that the [] in var x = []; would be lexed as an INDEX token? The parsing phase could be adjusted to treat INDEX an an empty list without whitespace? e.g.

change

listLiteral
    : CONST? typeArguments? '[' elements? ']'
    ;

to

listLiteral
    : CONST? typeArguments? ('[' elements? ']' | INDEX)
    ;

INDEX
    : '[]'
    ;

and operator signatures could be adjusted to accept both a '[' token followed by a ']' token or a single '[]' token.

e.g.

change

operatorSignature
    :    type? OPERATOR operator formalParameterList
    ;

operator
    :    '~'
    |    binaryOperator
    |    '[' ']'
    ...
    ;

to

operatorSignature
    :    type? OPERATOR (operator | '[' ']') formalParameterList
    ;


operator
    :    '~'
    |    binaryOperator
    |    INDEX
    ...
    ;

Do you think that this could work or am I still missing something?

@eernstg
Copy link
Member

eernstg commented Dec 21, 2022

Wouldn't the principle of maximal munch i.e. always consuming the largest input possible, as required by the Dart grammar, ensure that the [] in var x = []; would be lexed as an INDEX token?

We'd need to consider the lexical as well as the context free level. At the lexical level it is (I think) universally true that a lexer is expected to consume the longest possible prefix of the remaining input for each token that it produces.

If we're using a parser generator then it's typically also a matter of ordering the lexical rules correctly: If we have a rule that IDENTIFIER is [a-zA-Z_$][a-zA-Z_$0-9]* and we later have a rule that WHILE is 'while' then we'll never recognize 'while' as a keyword, because the IDENTIFIER rule kicks in first. Nevertheless, even in the case where the WHILE rule is specified first, and it matches 'while' in the source code, we must still get an IDENTIFIER from the lexer if the source code continues with more letters/digits, e.g., if we're looking at 'whileNotDone'.

I don't think we can make a similar statement at the context free level. For example:

<Start> ::= <N> 'a';
<N> ::= 'a' <N> | 'a';

If we're parsing 'aaa' according to this grammar, we'll derive <Start> to <N> 'a' and then <N> to 'a' <N>, then <N> to 'a', yielding 'a' 'a' 'a'. This means that at the point where we are deriving <N> to 'a' <N>, a parser would consume the 'a' on the input, leaving 'aa' as the rest of the input. However, at that point we will not derive <N> to 'a' <N> one more time to get 'a' 'a', because that would consume the entire input (and we're still waiting for one more 'a' according to the derivation from <Start>).

In other words, we certainly can't say that the derivations from any particular non-terminal during a parsing process will consume the longest possible part of the remaining input.

If we're using a table driven LR parser (like the one generated by a very old tool like YACC), the information which is used to decide whether we'd consume ('shift') the next token on the input or not is in the table and in the state of the parser (which is computed from the current parser stack using another part of the table), and that table is computed based on global properties of the grammar. In other words, we need to consider every single rule of the entire grammar in order to decide whether we'll be greedy or not in any given concrete situation.

I think we'll have to say that 'maximal munch' just doesn't make sense at the context free level.

The parsing phase could be adjusted to treat INDEX an an empty list without whitespace?

A certain amount of adjustments like that might be possible, but I'm pretty sure the complexity will explode. For instance, if '>>>=' is unconditionally turned into a COMPOUND_ASSIGNMENT_OPERATOR with the value >>>= then we'd need to remember to fix it up in the parser in the situations like 'List<List<List<int>>>==2'. I'm afraid we might well forget some situations; also note that this is outside the realm of context free languages, which means that we'd have to use ad-hoc reasoning in order to make sure that we're making the correct adjustments.

Do you think that this could work

I think it might work for INDEX, but I don't think it's very likely to scale up as a general approach. In particular, the example around >>>= shows that the parser might need to restructure the COMPOUND_ASSIGNMENT_OPERATOR into 3 > tokens and then reorganize = followed by another = into a single == token.

I think it's much simpler and safer to use a grammar which will allow for the actual structure to be detected as soon as possible. E.g., if we are parsing the compound assignment operator '>>>=' as a context free construct containing the four tokens '>' '>' '>' '=' then we're immediately and automatically also able to consume those tokens as part of some other syntactic construct (e.g., List<List<int>>, >=, List<Never>).

It is not difficult to enforce additional rules about whitespace after parsing, as a regular compile-time error. For instance, we could require that the compound assignment operator occurs as '>>>=' in the source code, with no spaces between the punctuation characters. Another example: I'm pretty sure we'd want to say that symbols like #[]= must be written without spaces, especially because the implementations are already enforcing that rule (so it's not a breaking change).

@modulovalue
Copy link
Contributor Author

modulovalue commented Dec 21, 2022

We'd need to consider the lexical as well as the context free level.

So that we are on the same page, can you elaborate why you are referring separately to a lexical and a context free level?
Isn't the full lexical grammar of Dart already not regular i.e. consider the MULTI_LINE_COMMENT rule in the current antlr based specification:

MULTI_LINE_COMMENT
    :    '/*' (MULTI_LINE_COMMENT | .)*? '*/'
         { skip(); }
    ;

Multi line comments nest, so the full lexical grammar can't be regular. To lex multi line comments (and strings with interpolations) we need a stack. Once we have a stack, we have a PDA and can therefore recognize context free languages. Therefore, any lexer for the full lexical grammar of Dart needs to be able to recognize context free languages? (This isn't entirely correct, I think a DPDA is enough, but that's probably not relevant for this comment)

@eernstg
Copy link
Member

eernstg commented Dec 21, 2022

referring separately to a lexical and a context free level

It is true that ANTLR allows for "lexical" rules to go beyond regular languages, but historically, and in many other parser generator tools, and, crucially, in the lexer that Dart tools use, lexer rules are regular. So we don't want to specify the language in a way that only works with ANTLR, and hence we should only use the regular part of the expressive power of ANTLR in Dart.g.

Next, even ANTLR makes a distinction between the lexical level and the context free level: The former will recognize sequences of characters (nothing ignored in each token, but ignoring whitespace between tokens). The context free level never gets to see the whitespace, so we can't make context free rules depend on whitespace (unless we use special ANTLR specific tricks).

So it does matter for the treatment of whitespace whether a given rule is lexical or context free.

the full lexical grammar of Dart [is] already not regular

Good catch! ;-)

That's true, the rule about MULTI_LINE_COMMENT does indeed use the context free power of ANTLR lexing.

This is a particularly simple case, however: It is sufficient for a lexer to count the number of /* and the number of */, and recognize the comment as complete when that count reaches 0. This goes outside regular languages, but it doesn't need anything like the full power of context free parsing. So it can be done by running a special multi-line comment recognizer at the point where the first /* is encountered.

We have a similar issue at the next level, by the way: The rule about multiLineString is not context free, it needs more power than that. Luckily, it just needs a tiny amount of extra power (we just need a stack), such that we can remember for a multi-line string that started with """ that it must also end with """, and similarly for '''.

However, the fact that Dart lexical analysis isn't quite covered by regular languages doesn't change the fact that we need to consider >>>= to be >, >, >, = in order to avoid excessive complexity in the handling of these tokens by the parser.

@modulovalue
Copy link
Contributor Author

modulovalue commented Dec 22, 2022

This is a particularly simple case, however: It is sufficient for a lexer to count the number of /* and the number of */, and recognize the comment as complete when that count reaches 0.

Have you considered e.g.:

  const a = """ /* /* ${ /* } */ 0 } */ """;

If we just count the number of /* and */ then we will not be able to correctly recognize that there's just one comment. So we need to recognize strings to recognize comments. So I think the following statement:

This goes outside regular languages, but it doesn't need anything like the full power of context free parsing

can't be true because of:

The rule about multiLineString is not context free, it needs more power than that.

But that's a different issue and I agree that it's not related to this issue. (I have a custom lexer for Dart that seems to be correct. I handle nested comments by pushing an extra state on the lexers stack that I use to switch to a different DFA for parsing multi line comment content.)


I agree with most of what you said. But there are some parts that I think I disagree with. However, I'm not absolutely sure that I'm right without code to support my opinion. I'm in the process of building an incremental autogenerated parser for Dart and I have some ideas for how to gracefully handle these issues. I will have to see how that works out.

I'm glad that this issue uncovered something. It would be fine with me If we closed it in its current state.

@eernstg
Copy link
Member

eernstg commented Dec 22, 2022

const a = """ /* /* ${ /* } */ 0 } */ """;

The counting of /* and */ would be done by the specialized comment recognizer, which means that the first two occurrences won't be counted: The comment recognizer receives the input starting from the third occurrence of /*, and it consumes /* } */; subsequent input is consumed by the normal lexer. For the comment recognizer, it is sufficient that it counts occurrences of /* and */, and it doesn't matter that there are extra occurrences anywhere else. So there is no need for the comment recognizer to be a lot smarter.

I handle nested comments by pushing an extra state on the lexers stack that I use to switch to a different DFA for parsing multi line comment content.

Aha, it sounds like that could be equivalent to the recognizer that I mentioned.

I'm glad that this issue uncovered something. It would be fine with me If we closed it in its current state.

It certainly revealed that the treatment of whitespace in the tools isn't identical to the treatment in the specification documents and in Dart.g. It's very useful to discover that sort of thing, so thanks for the detailed and useful input!

I'll close the issue, and then we'll just create new issues if and when something else comes up. ;-)

@eernstg eernstg closed this as completed Dec 22, 2022
modulovalue added a commit to modulovalue/sdk that referenced this issue Dec 4, 2023
This change introduces actions that, until implemented, serve as documentation for all the locations where whitespace is considered significant. 

See dart-lang#50776 for context around significant operator whitespace.

See https://github.com/dart-lang/language/blob/main/accepted/3.0/records/feature-specification.md#ambiguity-with-metadata-annotations for context around significant metadata whitespace. 

It seems to me that there are two ambiguities related to metadata and records. One can be observed when whitespace is present in front of the arguments and a record follows that metadatum, and one when there's no whitespace. We need to disambiguate both.

We disambiguate the case where whitespace is present by requiring the arguments to be preceded by no whitespace.

we disambiguate the case where whitespace is not present by requiring the other metadata productions to have trailing whitespace.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-language Dart language related items (some items might be better tracked at github.com/dart-lang/language). language-spec-parser
Projects
None yet
Development

No branches or pull requests

3 participants