-
Notifications
You must be signed in to change notification settings - Fork 79
Description
Sas7bcat labels with special characters are not correctly translated into UTF-8.
As an example coming from this R-Haven issue, when reading the file "formats.sas7bcat" coming from here, I get labels like "modalit\xe9 \xe01", which are not valid UTF8 but are valid windows-1252 or latin1. The thing is that readstat correctly sets the file.encoding to windows-1252, so that string should be already valid UTF-8 when my function readstat_value_label_handler gets it. This happens in pyreadstat, in R-Haven and debugging readstat with gdb. An user found a similar issue in pyreadstat for another file of his.
Looking at readstat_sas7bcat_read.c, in the function sas7bcat_parse_value_labels, it seems to me that the variable label never gets converted. I inserted the following after line 91 and cures the problem:
const char *label = &lbp2[10]; // this is line 91
//added! 20181011
char *label2[label_len];
retval = readstat_convert(label2, sizeof(label2),
label, label_len, ctx->converter);
if (retval != READSTAT_OK)
goto cleanup;
As my understanding of readstat and iconv is still low (hope to improve it!) I am not sure if this is the proper solution, and therefore I did not dare to send a PR, but I can do after your suggestions.
Another smaller, but still confusing thing is that if I set the encoding manually with readstat_set_file_character_encoding, to let's say something like LATIN1, and later I want to recover the file encoding with readstat_get_file_encoding, I still get WINDOWS-1252. The reason for this I think is because in readstat_sas7bcat_read.c line 371:
.file_encoding = hinfo->encoding
should be:
.file_encoding = ctx->input_encoding
as it is in readstat_sas7bdat_read.c line 594, to reflect that the user set the encoding manually.