Thank you for your contribution.
I have a question regarding the discretize function. It appears that the quantilization step within the discretization process is applied to the entire dataset, including both the training and validation splits. This means that event time values from the validation set are also used to determine the quantile boundaries. Could this potentially lead to data leakage, since information from the validation set is influencing the discretization process applied during training?
|
def _discretize_survival_months(self, eps, uncensored_df): |
|
r""" |
|
This is where we convert the regression survival problem into a classification problem. We bin all survival times into |
|
quartiles and assign labels to patient based on these bins. |
|
|
|
Args: |
|
- self |
|
- eps : Float |
|
- uncensored_df : pd.DataFrame |
|
|
|
Returns: |
|
- None |
|
|
|
""" |
|
# cut the data into self.n_bins (4= quantiles) |
|
disc_labels, q_bins = pd.qcut(uncensored_df[self.label_col], q=self.n_bins, retbins=True, labels=False) |
|
q_bins[-1] = self.label_data[self.label_col].max() + eps |
|
q_bins[0] = self.label_data[self.label_col].min() - eps |
|
|
|
# assign patients to different bins according to their months' quantiles (on all data) |
|
# cut will choose bins so that the values of bins are evenly spaced. Each bin may have different frequncies |
|
disc_labels, q_bins = pd.cut(self.patients_df[self.label_col], bins=q_bins, retbins=True, labels=False, right=False, include_lowest=True) |
|
self.patients_df.insert(2, 'label', disc_labels.values.astype(int)) |
|
self.bins = q_bins |
Thank you for your contribution.
I have a question regarding the discretize function. It appears that the quantilization step within the discretization process is applied to the entire dataset, including both the training and validation splits. This means that event time values from the validation set are also used to determine the quantile boundaries. Could this potentially lead to data leakage, since information from the validation set is influencing the discretization process applied during training?
SurvPath/datasets/dataset_survival.py
Lines 238 to 261 in 3f73ddd