Is StandardTextWriter broken for some Unicode characters? #870

PaulKlint · 2015-10-12T07:18:44Z

Is it possible that StandardTextWriter does not handle \u0000 correctly or am I -- being rather Unicode ignorant -- overlooking something? Here is an example:

import org.eclipse.imp.pdb.facts.IString;
import org.eclipse.imp.pdb.facts.IValueFactory;
import org.rascalmpl.values.ValueFactoryFactory;

public class TEST {
    public static void main(String[] args) {
        IValueFactory vf = ValueFactoryFactory.getValueFactory();
        IString s = vf.string("A\u0000B\u0000C");
        System.err.println(s.toString());
        System.err.println(s.getValue());
    }
}

prints:

"A\a00B\a00C"       <=== wrong
A�B�C

The text was updated successfully, but these errors were encountered:

DavyLandman · 2015-10-12T07:20:44Z

why is it wrong?

it's just not printing it as \u0000 but as \a00?

PaulKlint · 2015-10-12T07:35:36Z

The context: The \u0000 marker is hard-coded in generated rules that are automatically added by the parser generator to the Rascal grammar to delimit holes. When I get it back as \a00 it no longer parses. So the above behavior may be correct from a Unicode perspective but it causes a problem in this context. At least somewhere during the processing an inequality occurs that causes the parse to fail.

DavyLandman · 2015-10-12T07:43:11Z

But isn't it just an encoding error? In the string itself there is no difference. Could it be that the code in the parser generator should also support \a escapes?

DavyLandman · 2015-10-12T07:44:24Z

It could also be printed as \U000000....

PaulKlint · 2015-10-12T07:58:13Z

The parser generator already adds one extra rule per non-terminal (using \u0000). Not so nice if all these alternative encodings have to be supported as well ...

DavyLandman · 2015-10-12T08:12:57Z

could you give an example? I have a problem understanding how it doesn't parse?

rascal>"\u0000" == "\a00"
bool: true
rascal>lexical Test = "\u0000";
ok
rascal>[Test]"\a00"
Test: ``
Tree: appl(prod(lex("Test"),[lit("\a00")],{}),[appl(prod(lit("\a00"),[\char-class([range(0,0)])],{}),[char(0)])])[@loc=|prompt:///|(0,1,<1,0>,<1,1>)]
rascal>

PaulKlint · 2015-10-12T08:30:59Z

Is all internal processing inside the parser generator and not so easy to demonstrate on the command line. It boils down to the textual representation of a hole that is computed internally:

Interpreted: sort("A"):0
Compiled: \u0000sort("A"):0\u0000

So now I believe it is rather on the text generation side that something goes wrong: these null characters should not be escaped!

DavyLandman · 2015-10-12T09:08:44Z

ah yes, double escaping is always a scary thing. Shall we close the issue?

Btw, if you have \0 in the strings, always give a encoding to the ResolverRegistry. Since a \0 messes up the encoding detection. But that would give way stranger errors than the ones you are seeing now.

PaulKlint · 2015-10-12T20:06:58Z

Management summary: the compiler did not properly handle some escapes in string templates. As a result double escaping did occur as in the above example. That has been fixed.

PaulKlint added bug question labels Oct 12, 2015

PaulKlint closed this as completed Oct 12, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is StandardTextWriter broken for some Unicode characters? #870

Is StandardTextWriter broken for some Unicode characters? #870

PaulKlint commented Oct 12, 2015

DavyLandman commented Oct 12, 2015

PaulKlint commented Oct 12, 2015

DavyLandman commented Oct 12, 2015

DavyLandman commented Oct 12, 2015

PaulKlint commented Oct 12, 2015

DavyLandman commented Oct 12, 2015

PaulKlint commented Oct 12, 2015

DavyLandman commented Oct 12, 2015

PaulKlint commented Oct 12, 2015

Is StandardTextWriter broken for some Unicode characters? #870

Is StandardTextWriter broken for some Unicode characters? #870

Comments

PaulKlint commented Oct 12, 2015

DavyLandman commented Oct 12, 2015

PaulKlint commented Oct 12, 2015

DavyLandman commented Oct 12, 2015

DavyLandman commented Oct 12, 2015

PaulKlint commented Oct 12, 2015

DavyLandman commented Oct 12, 2015

PaulKlint commented Oct 12, 2015

DavyLandman commented Oct 12, 2015

PaulKlint commented Oct 12, 2015