Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is StandardTextWriter broken for some Unicode characters? #870

Closed
PaulKlint opened this issue Oct 12, 2015 · 9 comments
Closed

Is StandardTextWriter broken for some Unicode characters? #870

PaulKlint opened this issue Oct 12, 2015 · 9 comments

Comments

@PaulKlint
Copy link
Member

Is it possible that StandardTextWriter does not handle \u0000 correctly or am I -- being rather Unicode ignorant -- overlooking something? Here is an example:

import org.eclipse.imp.pdb.facts.IString;
import org.eclipse.imp.pdb.facts.IValueFactory;
import org.rascalmpl.values.ValueFactoryFactory;

public class TEST {
    public static void main(String[] args) {
        IValueFactory vf = ValueFactoryFactory.getValueFactory();
        IString s = vf.string("A\u0000B\u0000C");
        System.err.println(s.toString());
        System.err.println(s.getValue());
    }
}

prints:

"A\a00B\a00C"       <=== wrong
A�B�C
@DavyLandman
Copy link
Member

why is it wrong?

it's just not printing it as \u0000 but as \a00?

@PaulKlint
Copy link
Member Author

The context: The \u0000 marker is hard-coded in generated rules that are automatically added by the parser generator to the Rascal grammar to delimit holes. When I get it back as \a00 it no longer parses. So the above behavior may be correct from a Unicode perspective but it causes a problem in this context. At least somewhere during the processing an inequality occurs that causes the parse to fail.

@DavyLandman
Copy link
Member

But isn't it just an encoding error? In the string itself there is no difference. Could it be that the code in the parser generator should also support \a escapes?

@DavyLandman
Copy link
Member

It could also be printed as \U000000....

@PaulKlint
Copy link
Member Author

The parser generator already adds one extra rule per non-terminal (using \u0000). Not so nice if all these alternative encodings have to be supported as well ...

@DavyLandman
Copy link
Member

could you give an example? I have a problem understanding how it doesn't parse?

rascal>"\u0000" == "\a00"
bool: true
rascal>lexical Test = "\u0000";
ok
rascal>[Test]"\a00"
Test: ``
Tree: appl(prod(lex("Test"),[lit("\a00")],{}),[appl(prod(lit("\a00"),[\char-class([range(0,0)])],{}),[char(0)])])[@loc=|prompt:///|(0,1,<1,0>,<1,1>)]
rascal>

@PaulKlint
Copy link
Member Author

Is all internal processing inside the parser generator and not so easy to demonstrate on the command line. It boils down to the textual representation of a hole that is computed internally:

Interpreted: sort("A"):0
Compiled: \u0000sort("A"):0\u0000

So now I believe it is rather on the text generation side that something goes wrong: these null characters should not be escaped!

@DavyLandman
Copy link
Member

ah yes, double escaping is always a scary thing. Shall we close the issue?

Btw, if you have \0 in the strings, always give a encoding to the ResolverRegistry. Since a \0 messes up the encoding detection. But that would give way stranger errors than the ones you are seeing now.

@PaulKlint
Copy link
Member Author

Management summary: the compiler did not properly handle some escapes in string templates. As a result double escaping did occur as in the above example. That has been fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants