Lexing fails for string containing Unicode escape sequence

The lexer does not correctly handle input strings containing a Unicode escape sequence like `'Fran\u00E7ois'`, due to `token recognition error`. Wrapping the input stream in a `CaseInsensitiveInputStream` makes it work though.

Here is a unit test demo:
~~~java
    @Test
    void testLexerUnicodeEscapes() {
        String s = "'Fran\\u00E7ois'";

        // Using a plain CodePointCharStream fails
        IllegalStateException exc = assertThrows(IllegalStateException.class, () -> {
            tryLexing(CharStreams.fromString(s));
        });
        assertEquals("Syntax error on line 1:0: token recognition error at: ''Fran\\u00E'.", exc.getMessage());

        // Wrapping it in a CaseInsensitiveInputStream makes it work. Why?
        CommonTokenStream tokens = tryLexing(new CaseInsensitiveInputStream(CharStreams.fromString(s)));
        assertEquals(2, tokens.size());
    }

    private CommonTokenStream tryLexing(CharStream stream) {
        ApexLexer lexer = new ApexLexer(stream);
        lexer.removeErrorListeners(); // Avoid distracting "token recognition error" stderr output
        lexer.addErrorListener(new BaseErrorListener() {
            @Override
            public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line,
                int charPositionInLine, String msg, RecognitionException e) {
                throw new IllegalStateException(String.format("Syntax error on line %d:%d: %s.",
                    line, charPositionInLine, msg));
            }
        });
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        tokens.fill();
        return tokens;
    }
~~~

Is this a by design or a bug? The Apex language is case-insensitive but that shouldn't affect these string values.

Notes:
* Upgrading ANTLR from 4.9.1 to 4.13.2 does not solve it, but it's still good practice
* Lexing with `CommonTokenStream` works correctly for literal non-ASCII Unicode characters like `'François'`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Lexing fails for string containing Unicode escape sequence #55

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Lexing fails for string containing Unicode escape sequence #55

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions