Skip to content

Label Discretize Function Question #22

@GYDDHPY

Description

@GYDDHPY

Thank you for your contribution.

I have a question regarding the discretize function. It appears that the quantilization step within the discretization process is applied to the entire dataset, including both the training and validation splits. This means that event time values from the validation set are also used to determine the quantile boundaries. Could this potentially lead to data leakage, since information from the validation set is influencing the discretization process applied during training?

def _discretize_survival_months(self, eps, uncensored_df):
r"""
This is where we convert the regression survival problem into a classification problem. We bin all survival times into
quartiles and assign labels to patient based on these bins.
Args:
- self
- eps : Float
- uncensored_df : pd.DataFrame
Returns:
- None
"""
# cut the data into self.n_bins (4= quantiles)
disc_labels, q_bins = pd.qcut(uncensored_df[self.label_col], q=self.n_bins, retbins=True, labels=False)
q_bins[-1] = self.label_data[self.label_col].max() + eps
q_bins[0] = self.label_data[self.label_col].min() - eps
# assign patients to different bins according to their months' quantiles (on all data)
# cut will choose bins so that the values of bins are evenly spaced. Each bin may have different frequncies
disc_labels, q_bins = pd.cut(self.patients_df[self.label_col], bins=q_bins, retbins=True, labels=False, right=False, include_lowest=True)
self.patients_df.insert(2, 'label', disc_labels.values.astype(int))
self.bins = q_bins

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions