Skip to content

readstat not converting encoding of sas7bcat labels #152

@ofajardo

Description

@ofajardo

Sas7bcat labels with special characters are not correctly translated into UTF-8.

As an example coming from this R-Haven issue, when reading the file "formats.sas7bcat" coming from here, I get labels like "modalit\xe9 \xe01", which are not valid UTF8 but are valid windows-1252 or latin1. The thing is that readstat correctly sets the file.encoding to windows-1252, so that string should be already valid UTF-8 when my function readstat_value_label_handler gets it. This happens in pyreadstat, in R-Haven and debugging readstat with gdb. An user found a similar issue in pyreadstat for another file of his.

Looking at readstat_sas7bcat_read.c, in the function sas7bcat_parse_value_labels, it seems to me that the variable label never gets converted. I inserted the following after line 91 and cures the problem:

       const char *label = &lbp2[10]; // this is line 91
        //added! 20181011
        char *label2[label_len];
        retval = readstat_convert(label2, sizeof(label2),
                    label, label_len, ctx->converter);
        if (retval != READSTAT_OK)
                goto cleanup;

As my understanding of readstat and iconv is still low (hope to improve it!) I am not sure if this is the proper solution, and therefore I did not dare to send a PR, but I can do after your suggestions.

Another smaller, but still confusing thing is that if I set the encoding manually with readstat_set_file_character_encoding, to let's say something like LATIN1, and later I want to recover the file encoding with readstat_get_file_encoding, I still get WINDOWS-1252. The reason for this I think is because in readstat_sas7bcat_read.c line 371:

.file_encoding = hinfo->encoding

should be:

.file_encoding = ctx->input_encoding

as it is in readstat_sas7bdat_read.c line 594, to reflect that the user set the encoding manually.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions