Merge pull request mahmoodlab#90 from andrew-weisman/datatype_compari…

…son_bug-2021-12-01 Datatype comparison bug 2021-12-01
msk-mind · Dec 2, 2021 · 88da7ca · 88da7ca
2 parents 21c5bb2 + 5475cab
commit 88da7ca
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/datasets/dataset_generic.py b/datasets/dataset_generic.py
@@ -244,7 +244,7 @@ def return_splits(self, from_id=True, csv_path=None):
 
 		else:
 			assert csv_path 
-			all_splits = pd.read_csv(csv_path)
+			all_splits = pd.read_csv(csv_path, dtype=self.slide_data['slide_id'].dtype)  # Without "dtype=self.slide_data['slide_id'].dtype", read_csv() will convert all-number columns to a numerical type. Even if we convert numerical columns back to objects later, we may lose zero-padding in the process; the columns must be correctly read in from the get-go. When we compare the individual train/val/test columns to self.slide_data['slide_id'] in the get_split_from_df() method, we cannot compare objects (strings) to numbers or even to incorrectly zero-padded objects/strings. An example of this breaking is shown in https://github.com/andrew-weisman/clam_analysis/tree/main/datatype_comparison_bug-2021-12-01.
 			train_split = self.get_split_from_df(all_splits, 'train')
 			val_split = self.get_split_from_df(all_splits, 'val')
 			test_split = self.get_split_from_df(all_splits, 'test')