Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Slide ids turned into floats in split csv when names consist of only number #228

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ff98li
Copy link

@ff98li ff98li commented Feb 26, 2024

Summary of the Issue

  • Slide IDs consisting solely of numerical characters are inadvertently converted to floats in the split CSV files
    • The unequal lengths of train, val, and test splits introduce NaN values when these splits are concatenated into a dataframe by save_splits().
    • Pandas automatically converts columns with all-numeric names and NaN values to floats due to the lack of NaN rep in integer columns in Pandas.
      Screenshot 2024-02-26 at 1 27 58 PM
  • When loading via the following line, ValueError as shown in the screenshot will occur
    all_splits = pd.read_csv(csv_path, dtype=self.slide_data['slide_id'].dtype) # Without "dtype=self.slide_data['slide_id'].dtype", read_csv() will convert all-number columns to a numerical type. Even if we convert numerical columns back to objects later, we may lose zero-padding in the process; the columns must be correctly read in from the get-go. When we compare the individual train/val/test columns to self.slide_data['slide_id'] in the get_split_from_df() method, we cannot compare objects (strings) to numbers or even to incorrectly zero-padded objects/strings. An example of this breaking is shown in https://github.com/andrew-weisman/clam_analysis/tree/main/datatype_comparison_bug-2021-12-01.

    Screenshot 2024-02-26 at 2 07 43 PM

Proposed fix

  • Cast slide IDs to strings before being saved to CSV in save_splits to prevent unintended type conversion.
    • Result:
      Screenshot 2024-02-26 at 2 40 21 PM
  • Continue to read the dataset CSV with dtype=object in Generic_WSI_Classification_Dataset.

This happened when I was working with my own task's dataset csv. I can provide the csv file to reproduce this bug if needs be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant