Description
Out of curiosity I have an implemented an alternative parser for cppfront / cpp2, which uses a PEG grammar as input for a parser generator. During that experiment, I noticed that the grammar rules embedded as //G
comments are not always correct. I will list errors that I noticed below.
One preliminary note: The cppfront compiler has a rather relaxed concept of keywords. In most cases it will accept a keyword were an identifier is expected, for example it will happily compile if: () -> void = { }
. I don't think that is a good idea, my grammar explicitly distinguishes between keywords and identifiers. (Modulo the few context specific soft-keywords like in
/out
etc.). For some grammar rules that requires changes were the parser previously worked by accident (i.e, by not recognizing a certain keyword).
a) id_expression
//G id-expression
//G unqualified-id
//G qualified-id
//G
here the order is wrong, it should be
//G id-expression
//G qualified-id
//G unqualified-id
//G
b) primary_expression
//G primary-expression:
//G literal
//G ( expression-list )
//G id-expression
//G unnamed-declaration
//G inspect-expression
//G
this does not correspond to the source code order. Furthermore, the expression-list is optional. And if we distinguish keywords from literals we potentially need some extra rules to handle keywords that are currently silently eaten as identifier. I would suggest
//G primary-expression:
//G inspect-expression
//G id-expression
//G literal
//G '(' expression-list? ')'
//G unnamed-declaration
//G 'nullptr'
//G 'true'
//G 'false'
//G 'typeid' '(' expression ')'
//G 'new' < id-expression > '(' expression-list? ')'
c) nested-name-specifier
//G nested-name-specifier:
//G ::
//G unqualified-id ::
this has to support nested scopes. I would suggest
//G nested-name-specifier:
//G :: (unqualified-id ::)*
//G (unqualified-id ::)+
d) template-argument
//G template-argument:
//G expression
//G id-expression
There should be a comment here that we disable '<'/'>'/'<<'/'>>' in the expressions until a new parentheses is opened. In fact that causes some of the expression rules to be cloned until we reach the level below these operators. (In my implementation these are the rules with suffix _no_cmp).
e) id-expression from fundamental types
We want to accept builtin types like int
as type ids. Currently this works by accident because the parser does not even recognize these as keywords. When enforcing that keywords are not identifiers we need rules for these, too. I have added a fundamental-type
alternative at the end of id-expression, and have defines that as follows:
fundamental-type
'void'
fundamental-type-modifier_list? 'char'
'char8_t'
'char16_t'
'char32_t'
'wchar_t'
fundamental-type-modifier-list? 'int'
'bool'
'float'
'double'
'long' 'double'
fundamental-type-modifier-list
fundamental-type-modifier-list
fundamental-type-modifier+
fundamental-type-modifier
'unsigned'
'signed'
'long'
'short'