Antlr grammar for C/C++ Preprocessor.

I made a Lexer/Parser/Application of C-preprocessor with Antlr and C#. I like to share my work with anybody who may have interest in doing it by himself or herself.

Of course, there will be many or better ways to do the same thing than the way I did. Here is a summary of how I did.

Lexer A and Parser A take a C/C++ source text string and parse macro keywords such as #include, #define, #if, #ifdef, #ifndef and so on.
Application A stores #define keywords parsed from Parser A; there are two types of macro defines: one that takes arguments and another that doesn't take arguments.
Whenever #if or #elif are encountered, the macro keywords in the condition statement are resolved with the stored define macros. This process is done by Lexer A, Parser B; Lexer A is reused. This step is repeated until there is nothing to be replaced. Once all of macro keywords are resolved, arithmetic calculation is performed with Lexer B, Parser C and Application B.

Let me show you an example of #define and #if statements.

    line1: #define VariableType 4
    line2: #define FunctionType( a, b )     a - VariableType * b
    line3:
    line4: #if FunctionType( (2*3), 2 ) + 1 * 2
    line5: #error "This should not be processed"
    line6: #else
    line7: #undef VariableType
    line8: #endif

In the line1, a new macro keyword, "VariableType", will be stored by ApplicationA.

In the line2, a new macro keyword, "FunctionType", will be stored by input argument information.

In the line4, the macro, "FunctionType( (23), 2 )", is replaced by the definition and it will result in a new string, "(23) - VariableType * 2". The replacing step must be done after proper parser and that's why I needed a second Parser B for the macro resolving. For example, without Parser B, simple string replacement will cause errors in cases like "FunctionType( FunctionType( 1, 3 ), 3 )".

Since a macro definition can include another macro define keywords, one time macro resolving is not enough so the step had to be repeated until there is no macro keyword left.

Once all of macro keywords are resolved, the calculation must be performed based on operator precedence. That's where I needed 3rd Parser C and second Lexer B. For the example, the parser will parse a statement, "(2*3) - 4 * 2 + 1 * 2". This is a relatively simple calculator, which is Application B.

Based on the calculation result, one of statement blocks, which is line 5 or line 7, are ignored and another statement is processed. In the example, the calculation result is 0 and line7 will be processed.

I can share the Antlr grammars here but application source codes are too long to be posted; I wrote C# applications.

Parser A, which I call CPreprocessor, is here:


    grammar CPreprocessor;
    import CPreprocessorLexerRule;
    // parser rules must start with lower case.

    @parser::members
    {
        protected const int EOF = Eof;
    }

    program        :    statement*    EOF    ;

    statement    :    preprocess
                |    ( COMMENT_BLOCK | COMMENT_LINE )
                |    STRING                        // ignore
                |    ( ppChars | NEWLINE )        // ignore
                ;

    // pp means preprocess
    preprocess    :    ppInclude
                |    ppDefineFunc    // this must be before ppDefineVar: eg. #define AAA  ( a ) 
                |    ppDefineVar
                |    ppUndef
                |    ppPragma
                |    ppError
                |    ppIfStatement
                ;

    ppChars        :    ( STRING | WS | PREPROCESS_BEGIN | DEFINED | ID | COMMA | PAREN_OPEN | PAREN_CLOSE | CHAR )+    ;

    ppInclude    :    PREPROCESS_BEGIN    WS*    PREPROCESS_INCLUDE    WS*    ppChars    WS*    ( NEWLINE | EOF )    ;
    ppPragma    :    PREPROCESS_BEGIN    WS*    PREPROCESS_PRAGMA    WS*    ppChars    WS*    ( NEWLINE | EOF )    ;
    ppError        :    PREPROCESS_BEGIN    WS*    PREPROCESS_ERROR    WS*    ppChars    WS*    ( NEWLINE | EOF )    ;

    ppUndef            :    PREPROCESS_BEGIN    WS*    PREPROCESS_UNDEF    WS+    ID                            WS*    ( NEWLINE | EOF )    ;
    ppDefineVar        :    PREPROCESS_BEGIN    WS*    PREPROCESS_DEFINE    WS+    ID        ( WS+    ppChars )?    WS*    ( NEWLINE | EOF )    ;
    ppDefineFunc    :    PREPROCESS_BEGIN    WS*    PREPROCESS_DEFINE    WS+    ppdfId    WS*    ppdfChars        WS*    ( NEWLINE | EOF )    ;

    // ppdfi means preprocess define function ID
    ppdfId            :    ID    PAREN_OPEN    WS*    ppdfiArguments    WS*    PAREN_CLOSE    ;    // WS between ID and '(' is not allowed.
    ppdfiArguments    :    ppdfiArgument    ( WS*    COMMA    WS*    ppdfiArgument    )*    ;    // at least 1 argument is required.
    ppdfiArgument    :    ID    ;

    // ppdfc means preprocess define function characters
    ppdfChars    :    ( ppdfcId | ppdfcNotId )+    ;
    ppdfcId        :    ID    ;
    ppdfcNotId    :    ( STRING | WS | PREPROCESS_BEGIN | DEFINED | COMMA | PAREN_OPEN | PAREN_CLOSE | CHAR )    ;


    ppIfStatement    :    ( ppisIF | ppisIfDef | ppisIfNdef )    ;

    ppisIF            :    PREPROCESS_BEGIN    WS*    PREPROCESS_IF        WS+    ppChars    WS*    NEWLINE            ppisStatement    ppisElifElseEndif    ;
    ppisElif        :    PREPROCESS_BEGIN    WS*    PREPROCESS_ELIF        WS+    ppChars    WS*    NEWLINE            ppisStatement    ppisElifElseEndif    ;
    ppisElse        :    PREPROCESS_BEGIN    WS*    PREPROCESS_ELSE                    WS*    NEWLINE            ppisStatement    ppisEndif    ;
    ppisEndif        :    PREPROCESS_BEGIN    WS*    PREPROCESS_ENDIF                WS*    (NEWLINE|EOF)    ;
    ppisIfDef        :    PREPROCESS_BEGIN    WS*    PREPROCESS_IFDEF    WS+    ID        WS*    NEWLINE            ppisStatement    ppisElseEndif    ;
    ppisIfNdef        :    PREPROCESS_BEGIN    WS*    PREPROCESS_IFNDEF    WS+    ID        WS*    NEWLINE            ppisStatement    ppisElseEndif    ;

    ppisStatement        :    statement*    ;
    ppisElifElseEndif    :    ppisElif
                        |    ppisElseEndif
                        ;
    ppisElseEndif        :    ppisElse
                        |    ppisEndif
                        ;

Lexer A, which I call "CPreprocessorLexerRule", is here:

    lexer grammar CPreprocessorLexerRule;
    // lexer rules must start with upper case.

    @lexer::members {
        public const int CHANNEL_COMMENT_BLOCK = 1;
        public const int CHANNEL_COMMENT_LINE = 2;
        protected const int EOF = Eof;
        protected const int HIDDEN = Hidden;
    }

    //=======================================================================
    COMMENT_BLOCK        :    '/*'    .*?        '*/'            -> channel(CHANNEL_COMMENT_BLOCK) ;
    COMMENT_LINE        :    '//'    NOT_NL*                    -> channel(CHANNEL_COMMENT_LINE) ;
    NL_ESC                :    '\\'    NL                        -> skip ;
    STRING                :    '"'    (STRING_ESC|.)*?    '"'        ;
    WS                    :    ' '    |    '\t'                    ;    // preprocessor cannot "skip" white space, because wb matters in several cases.
    NEWLINE                :    NL                                ;    // preprocessor cannot "skip" new line, because wb matters in several cases.
    PREPROCESS_BEGIN    :    '#'                                ;

    //=======================================================================
    // nmode doesn't seem to work for C#
    // mode preprocessMode;
    //=======================================================================
    PREPROCESS_INCLUDE    :    'include'    ;
    PREPROCESS_DEFINE    :    'define'    ;
    PREPROCESS_UNDEF    :    'undef'        ;
    PREPROCESS_IFDEF    :    'ifdef'        ;    // this must appear before 'if'
    PREPROCESS_IFNDEF    :    'ifndef'    ;    // this must appear before 'if'
    PREPROCESS_IF        :    'if'        ;
    PREPROCESS_ELSE        :    'else'        ;
    PREPROCESS_ELIF        :    'elif'        ;
    PREPROCESS_ENDIF    :    'endif'        ;
    PREPROCESS_PRAGMA    :    'pragma'    ;
    PREPROCESS_ERROR    :    'error'        ;

    //=======================================================================
    // These must be below reserved keywords
    //=======================================================================
    DEFINED        :    'defined'    ;
    ID            :    ID_LETTER    (ID_LETTER|DIGIT)*    ;
    COMMA        :    ','                                ;
    PAREN_OPEN    :    '('                                ;
    PAREN_CLOSE    :    ')'                                ;
    CHAR        :    '\''    .    '\''
                |    .                                // the rest of all
                ;

    //=======================================================================
    // fragment can be referenced only by lexer.
    // fragment itself is not lexer rule so that it cannot be used by parser.
    //=======================================================================
    fragment ID_LETTER    :    [a-zA-Z_]            ;
    fragment DIGIT        :    [0-9]                ;
    fragment STRING_ESC    :    '\\'    [btnr0"\\]    ; // \b, \t, \n etc...
    fragment NL            :    '\r'?    '\n'        ;
    fragment NOT_NL        :    ~[\r\n]                ;

Parser B, which I call "CPMacroResolve", is here:

    grammar CPMacroResolve;
    import CPreprocessorLexerRule;

    // Asttumptions for the input statement:
    // 1. No block/line comment exists.
    // 2. No '#' character exists.
    // 3. No preprocessor keywords exists except "defined"

    statement    :    sToken+    ;

    sToken        :    stDefined
                |    stFunctionCall
                |    stVariable
                |    stOther
                ;

    stDefined    :    DEFINED    WS+    stdId    ;
    stdId        :    PAREN_OPEN    WS*    stdId    WS*    PAREN_CLOSE
                |    stdiId
                ;
    stdiId        :    ID    ;

    stVariable    :    ID    ;

    stFunctionCall    :    ID    WS*    PAREN_OPEN    stfcArguments    PAREN_CLOSE    ;
    stfcArguments    :    stfcaArgument    ( COMMA    stfcaArgument )*    ;
    stfcaArgument    :    ( PAREN_OPEN    stfcaArgument    PAREN_CLOSE | STRING | WS | ID | CHAR )*    ;

    stOther        :    ( STRING | WS | COMMA | PAREN_OPEN | PAREN_CLOSE | CHAR )    ;

Parser C, which I call "CPCondition", is here:

    grammar CPCondition;
    import CPConditionLexerRule;

    // Assumptions:
    // 1. Lexer will remove white spaces.

    condition    : expression ;

    // Indirect left-recursion for binary/ternary expressions is solved by "direct" left-recursion
    expression    :    PAREN_OPEN    expression    PAREN_CLOSE                                                    # ceParen
                |    unaryExpression                                                                        # ceUnary
                |    expression    op=(AOP_MUL|AOP_DIV|AOP_MOD)        expression                            # ceMulDivMod
                |    expression    op=(AOP_ADD|AOP_SUB)                expression                            # ceAddSub
                |    expression    op=(BOP_SHL|BOP_SHR)                expression                            # ceShlShr
                |    expression    op=(CMP_LE|CMP_LT|CMP_GE|CMP_GT)    expression                            # ceCmpLeLtGeGt
                |    expression    op=(CMP_EQ|CMP_NE)                    expression                            # ceCmpEqNe
                |    expression    BOP_AND                                expression                            # ceBitAnd
                |    expression    BOP_XOR                                expression                            # ceBitXor
                |    expression    BOP_OR                                expression                            # ceBitOr
                |    expression    LOP_AND                                expression                            # ceLogicAnd
                |    expression    LOP_OR                                expression                            # ceLogicOr
                |    expression    TER_IF                                expression    TER_ELS    expression        # ceTerIf
                |    value                                                                                # ceValue
                ;

    unaryExpression    :    op=(AOP_ADD|AOP_SUB|BOP_NOT|LOP_NOT)    expression    ;

    value    :    val=FLOAT
            |    val=INT
            |    val=HEX
            |    val=OCT
            |    val=TRUE
            |    val=FALSE
            |    val=ID    // error
            ;

Lexer B, which I call "CPConditionLexerRule", is here:

    lexer grammar CPConditionLexerRule;

    // Logic operator
    LOP_AND    :    '&&'    ;
    LOP_OR    :    '||'    ;
    LOP_NOT    :    '!'    ;

    // Bit operator
    BOP_NOT    :    '~'    ;
    BOP_XOR    :    '^'    ;
    BOP_AND    :    '&'    ;
    BOP_OR    :    '|'    ;
    BOP_SHL    :    '<<'    ;
    BOP_SHR    :    '>>'    ;

    // Arithmatic operator
    AOP_ADD    :    '+'    ;
    AOP_SUB    :    '-'    ;
    AOP_MUL    :    '*'    ;
    AOP_DIV    :    '/'    ;
    AOP_MOD    :    '%'    ;

    // Comparison operator
    CMP_EQ    :    '=='    ;
    CMP_NE    :    '!='    ;
    CMP_LE    :    '<='    ;
    CMP_LT    :    '<'    ;
    CMP_GE    :    '>='    ;
    CMP_GT    :    '>'    ;

    // Ternary conditional
    TER_IF    :    '?'    ;
    TER_ELS    :    ':'    ;

    // Parentheses
    PAREN_OPEN    :    '('    ;
    PAREN_CLOSE    :    ')'    ;

    // Reserved keyword
    DEFINED    :    'defined'    ;
    TRUE    :    'true'    ;
    FALSE    :    'false'    ;

    // typed values
    FLOAT    :    DIGIT+    '.'    DIGIT*    'f'?            // match 1. 39. 3.14159 etc...
            |            '.'    DIGIT+    'f'?            // match .1 .14159
            ;
    INT        :    [1-9]    DIGIT*
            |    '0'
            ;
    HEX        :    '0x'    DIGIT+    ;    // 0x000
    OCT        :    '0'        DIGIT+    ;    // 0123
    ID        :    ID_LETTER    (ID_LETTER|DIGIT)*    ;

    // skip
    NL_ESC    :    '\\'    '\r'?    '\n'    -> skip ;
    WS        :    [ \t]+                    -> skip ;


    // fragments
    fragment ID_LETTER    :    [a-zA-Z_]    ;
    fragment DIGIT        :    [0-9]    ;

I am hoping that this information can be helpful for anybody who is planning to make their own version of C preprocessor or any kind of preprocessors.

I read one IBM guy suggested that all of C preprocessing should be done by Lexer only. I don't see if it is a better way or not because as I described, C preprocessor also need parsing steps several times.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jay-blog.md

jay-blog.md

Antlr grammar for C/C++ Preprocessor.

Files

jay-blog.md

Latest commit

History

jay-blog.md

File metadata and controls

Antlr grammar for C/C++ Preprocessor.