Label Discretize Function Question

Thank you for your contribution. 

I have a question regarding the discretize function. It appears that the quantilization step within the discretization process is applied to the entire dataset, including both the training and validation splits. This means that event time values from the validation set are also used to determine the quantile boundaries. Could this potentially lead to data leakage, since information from the validation set is influencing the discretization process applied during training?

https://github.com/mahmoodlab/SurvPath/blob/3f73ddd6705ec67d643020c5bb04fb13f9f382cc/datasets/dataset_survival.py#L238-L261

	def _discretize_survival_months(self, eps, uncensored_df):
	r"""
	This is where we convert the regression survival problem into a classification problem. We bin all survival times into
	quartiles and assign labels to patient based on these bins.

	Args:
	- self
	- eps : Float
	- uncensored_df : pd.DataFrame

	Returns:
	- None

	"""
	# cut the data into self.n_bins (4= quantiles)
	disc_labels, q_bins = pd.qcut(uncensored_df[self.label_col], q=self.n_bins, retbins=True, labels=False)
	q_bins[-1] = self.label_data[self.label_col].max() + eps
	q_bins[0] = self.label_data[self.label_col].min() - eps

	# assign patients to different bins according to their months' quantiles (on all data)
	# cut will choose bins so that the values of bins are evenly spaced. Each bin may have different frequncies
	disc_labels, q_bins = pd.cut(self.patients_df[self.label_col], bins=q_bins, retbins=True, labels=False, right=False, include_lowest=True)
	self.patients_df.insert(2, 'label', disc_labels.values.astype(int))
	self.bins = q_bins

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Label Discretize Function Question #22

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Label Discretize Function Question #22

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions