-
Notifications
You must be signed in to change notification settings - Fork 9
Closed
Description
The lexer does not correctly handle input strings containing a Unicode escape sequence like 'Fran\u00E7ois'
, due to token recognition error
. Wrapping the input stream in a CaseInsensitiveInputStream
makes it work though.
Here is a unit test demo:
@Test
void testLexerUnicodeEscapes() {
String s = "'Fran\\u00E7ois'";
// Using a plain CodePointCharStream fails
IllegalStateException exc = assertThrows(IllegalStateException.class, () -> {
tryLexing(CharStreams.fromString(s));
});
assertEquals("Syntax error on line 1:0: token recognition error at: ''Fran\\u00E'.", exc.getMessage());
// Wrapping it in a CaseInsensitiveInputStream makes it work. Why?
CommonTokenStream tokens = tryLexing(new CaseInsensitiveInputStream(CharStreams.fromString(s)));
assertEquals(2, tokens.size());
}
private CommonTokenStream tryLexing(CharStream stream) {
ApexLexer lexer = new ApexLexer(stream);
lexer.removeErrorListeners(); // Avoid distracting "token recognition error" stderr output
lexer.addErrorListener(new BaseErrorListener() {
@Override
public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line,
int charPositionInLine, String msg, RecognitionException e) {
throw new IllegalStateException(String.format("Syntax error on line %d:%d: %s.",
line, charPositionInLine, msg));
}
});
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
return tokens;
}
Is this a by design or a bug? The Apex language is case-insensitive but that shouldn't affect these string values.
Notes:
- Upgrading ANTLR from 4.9.1 to 4.13.2 does not solve it, but it's still good practice
- Lexing with
CommonTokenStream
works correctly for literal non-ASCII Unicode characters like'François'
Metadata
Metadata
Assignees
Labels
No labels