Properly normalize column names in Utils.GetSampleData() for duplicate cases #5280
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fix #5267
This PR fixes the bug where columns generated from inline data were normalized directly through
Utils.Normalize(), which only fixes the naming of a given column name, but does not take into account duplicate column names that may exist in a dataset.PR #5177 introduced a way to fix these duplicate column names by adding the differentiator suffix '_col_x' where 'x' represents the the dataset load order for a given column. In this PR I have separated this generation of distinct and unique column names from
Utils.GenerateClassLabels()and made it into its own function toUtils.GenerateColumnNames(). This is so that this generation of distinct and unique column names can also be used inUtils.GenerateSampleData, which before this PR resulted in exceptions. Now, column names from inline data are properly normalized, and duplicate column names are handled.This PR also adds a unit test to test the case of duplicate column names with
Utils.GenerateSampleData.