Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot read correctly variable name #268

Open
ofajardo opened this issue Mar 4, 2022 · 3 comments
Open

cannot read correctly variable name #268

ofajardo opened this issue Mar 4, 2022 · 3 comments

Comments

@ofajardo
Copy link

ofajardo commented Mar 4, 2022

When reading the attached file, there should be a variable name "BRANDAA_SUN_1", I get instead "BRANDAA". PSPP can read the variable name correctly. I think the file has been created using the IBM spss dll files instead of the full application. If the file is opened in spss and saved, then it is read correctly. I have tested with a simple C program that the issue is indeed coming from Readstat:

#include "readstat.h"

int handle_metadata(readstat_metadata_t *metadata, void *ctx) {
    int *my_count = (int *)ctx;

    *my_count = readstat_get_row_count(metadata);

    return READSTAT_HANDLER_OK;
}

int handle_variable(int index, readstat_variable_t *variable, char *val_labels, void *ctx)
{
	char * var_name;
	var_name = readstat_variable_get_name(variable);
	printf("Variable: %s\n", var_name);
}

int main(int argc, char *argv[]) {
    if (argc != 2) {
        printf("Usage: %s <filename>\n", argv[0]);
        return 1;
    }
    int my_count = 0;
    readstat_error_t error = READSTAT_OK;
    readstat_parser_t *parser = readstat_parser_init();
    readstat_set_metadata_handler(parser, &handle_metadata);
    readstat_set_variable_handler(parser, &handle_variable);

    error = readstat_parse_sav(parser, argv[1], &my_count);

    readstat_parser_free(parser);

    if (error != READSTAT_OK) {
        printf("Error processing %s: %d\n", argv[1], error);
        return 1;
    }
    printf("Found %d records\n", my_count);
    return 0;
}

test.SAV.zip

original report: Roche/pyreadstat#165

@ofajardo
Copy link
Author

ofajardo commented Jan 10, 2024

here another file with a similar issue, this file has apparently been created using SPSS (not the dlls as in the previous example). Here the variable which name has been truncated is XC0DAB1_1 (truncated to XC0DAB1), it is the variable in position 85 (counting from 1). Again pspp reads the variable correctly.

original report

CRO_MX.zip

@zenelba
Copy link

zenelba commented Jan 12, 2024

here another file with a similar issue, this file has apparently been created using SPSS (not the dlls as in the previous example). Here the variable which name has been truncated is XC0DAB1_1 (truncated to XC0DAB1), it is the variable in position 85 (counting from 1). Again pspp reads the variable correctly.

Yes - I opened it in SPSS and saved it again.

@mtr
Copy link

mtr commented Feb 16, 2024

I am experiencing an error that seems related to this.

I am sorry, but I cannot share the (customer's) data file, and haven't been able (had the time) to generate a synthesized example file that triggers the bug. However, I have been able to narrow down the issue a little bit:

  1. I write a file using pyreadstat (which has 487 columns and 530348 rows), let's call it file_broken.sav. Some of the columns/variables have names with Norwegian letters, like “æ”, “ø”, and “å”. When writing that file, the input column to the pyreadstat.write_sav() function is named "forn_1" with lowercase letters (checked in-memory with debugger).
  2. When file_broken.sav has been written to disk, the column has (automatically) been renamed to "FORN_1" in uppercase (this is strange). I have checked this by reading the file with both readstat and pyreadstat.
  3. If I open file_broken.sav with the SPSS program, the variable is shown as "forn_1" in lowercase, and if I save the exact same file from SPSS as file_ok.sav, the variable on disk is no longer in uppercase.

So, trying to see if the error is caused by pyreadstat or readstat, I tried the following, using a (freshly) compiled (C) readstat and extract_metadata binaries:

  1. Extract the OK metadata:
./extract_metadata file_ok.sav file_ok-metadata.json
  1. Verify that only "forn_1" and not "FORN_1" is present in the file_ok-metadata.json. Hence, reading the file seems to work, and writing the metadata separately.
  2. (First [failed] attempt) Create a CSV version of the datafile:
./readstat file_ok.sav file_ok.csv
Converted 489 variables and 88013 rows in 4.49 seconds
Error processing file_ok.sav: Unable to convert string to the requested encoding (invalid byte sequence)
  1. (Second attempt) Create a CSV version of the datafile by manually renaming from "FORN_1" -> "forn_1":
./readstat file_broken.sav file_broken.csv
sed 's/"FORN_1"/"forn_1"/g' file_broken.csv > file_ok.csv
  1. Isolate writing of the file using readstat (the C version) by combining data and metadata into new file:
./readstat file_ok.csv file_ok-metadata.json output.sav
  1. Extracting the new metadata:
./extract_metadata output.sav output-metadata.json
  1. Opening the new output-metadata.json shows that the variable is now named FORN_1.

When I wrote this I was surprised by the error during my first attempt at converting data from .sav to .csv. I guess I will inspect the data file around row 88013.

I am sorry that I cannot provide a reproducible error report, but thought that this might shed some light on where to look for the cause of this bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants