Skip to content

Lexing fails for string containing Unicode escape sequence #55

@wahajenius

Description

@wahajenius

The lexer does not correctly handle input strings containing a Unicode escape sequence like 'Fran\u00E7ois', due to token recognition error. Wrapping the input stream in a CaseInsensitiveInputStream makes it work though.

Here is a unit test demo:

    @Test
    void testLexerUnicodeEscapes() {
        String s = "'Fran\\u00E7ois'";

        // Using a plain CodePointCharStream fails
        IllegalStateException exc = assertThrows(IllegalStateException.class, () -> {
            tryLexing(CharStreams.fromString(s));
        });
        assertEquals("Syntax error on line 1:0: token recognition error at: ''Fran\\u00E'.", exc.getMessage());

        // Wrapping it in a CaseInsensitiveInputStream makes it work. Why?
        CommonTokenStream tokens = tryLexing(new CaseInsensitiveInputStream(CharStreams.fromString(s)));
        assertEquals(2, tokens.size());
    }

    private CommonTokenStream tryLexing(CharStream stream) {
        ApexLexer lexer = new ApexLexer(stream);
        lexer.removeErrorListeners(); // Avoid distracting "token recognition error" stderr output
        lexer.addErrorListener(new BaseErrorListener() {
            @Override
            public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line,
                int charPositionInLine, String msg, RecognitionException e) {
                throw new IllegalStateException(String.format("Syntax error on line %d:%d: %s.",
                    line, charPositionInLine, msg));
            }
        });
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        tokens.fill();
        return tokens;
    }

Is this a by design or a bug? The Apex language is case-insensitive but that shouldn't affect these string values.

Notes:

  • Upgrading ANTLR from 4.9.1 to 4.13.2 does not solve it, but it's still good practice
  • Lexing with CommonTokenStream works correctly for literal non-ASCII Unicode characters like 'François'

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions