Skip to content

Commit

Permalink
Merge pull request mahmoodlab#90 from andrew-weisman/datatype_compari…
Browse files Browse the repository at this point in the history
…son_bug-2021-12-01

Datatype comparison bug 2021-12-01
  • Loading branch information
fedshyvana authored Dec 2, 2021
2 parents 21c5bb2 + 5475cab commit 88da7ca
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion datasets/dataset_generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -244,7 +244,7 @@ def return_splits(self, from_id=True, csv_path=None):

else:
assert csv_path
all_splits = pd.read_csv(csv_path)
all_splits = pd.read_csv(csv_path, dtype=self.slide_data['slide_id'].dtype) # Without "dtype=self.slide_data['slide_id'].dtype", read_csv() will convert all-number columns to a numerical type. Even if we convert numerical columns back to objects later, we may lose zero-padding in the process; the columns must be correctly read in from the get-go. When we compare the individual train/val/test columns to self.slide_data['slide_id'] in the get_split_from_df() method, we cannot compare objects (strings) to numbers or even to incorrectly zero-padded objects/strings. An example of this breaking is shown in https://github.com/andrew-weisman/clam_analysis/tree/main/datatype_comparison_bug-2021-12-01.
train_split = self.get_split_from_df(all_splits, 'train')
val_split = self.get_split_from_df(all_splits, 'val')
test_split = self.get_split_from_df(all_splits, 'test')
Expand Down

0 comments on commit 88da7ca

Please sign in to comment.