Implemented decoding for numerical stats mixin and integer profiles#844
Conversation
| profile = super().load_from_dict(data) | ||
| profile._load_hist_helper(data) | ||
| quantiles = data.pop("quantiles") | ||
| quantiles_dict = {int(key): quantiles[key] for key in quantiles.keys()} |
There was a problem hiding this comment.
quantiles, I think, we will have to cast the keys to whatever the specific load_from_dict profiler is that were using ie:
floatprofiler will have to be cast to floats. This is because by default the keys are decoded as strings when loaded in
| profile._load_hist_helper(data) | ||
| quantiles = data.pop("quantiles") | ||
| quantiles_dict = {int(key): quantiles[key] for key in quantiles.keys()} | ||
| profile.quantiles = quantiles_dict |
There was a problem hiding this comment.
wonder if this can be moved into a numeric_stats_mixin method named load_from_dict if it isn't just unique to int profiles.
There was a problem hiding this comment.
Yeah this theoretically could move to the numeric column stats fucntion load_hist_helper but I dont think the casting is only ever going to be ints
| if key == "histogram": | ||
| value = { | ||
| x: np.array(hist[key][x]) if hist[key][x] is not None else None | ||
| for x in hist[key].keys() | ||
| } | ||
| self._stored_histogram[key] = value |
There was a problem hiding this comment.
wonder if this can be moved into a numeric_stats_mixin method named load_from_dict if it isn't just unique to int profiles.
There was a problem hiding this comment.
... wait this is a numeric stats mixin function?
There was a problem hiding this comment.
I think we dont want to have a load from dict function considering this is mostly an abstract class
…te test for encode/decode. fixed profile comparison to include loose type comparison between floats and float64s
| f"Object {type(profile)} has no attribute {function}." | ||
| ) | ||
| value[metric] = getattr(profile, function) | ||
| value[metric] = getattr(profile, function).__func__ |
There was a problem hiding this comment.
Needed because the function must be set as a function and not a bound method
| :return: None | ||
| """ | ||
| self.match_count += profile.pop("match_count") | ||
| self.match_count += int(profile.pop("match_count")) |
There was a problem hiding this comment.
Cast added to adhere to type specification in numeric stats mixin attribute initialization (avoids setting to np.int64)
| """ | ||
| BaseColumnProfiler._add_helper(self, other1, other2) | ||
| self.match_count = other1.match_count + other2.match_count | ||
| self.match_count = int(other1.match_count + other2.match_count) |
There was a problem hiding this comment.
Cast added to adhere to type specification in numeric stats mixin attribute initialization (avoids setting to np.int64)
| self.quantiles: list[float] | dict = { | ||
| bin_num: None for bin_num in range(num_quantiles - 1) | ||
| } | ||
| self.quantiles: list[float] = [bin_num for bin_num in range(num_quantiles - 1)] |
There was a problem hiding this comment.
Modified quantiles to be set to list (previously could be list or dictionary which is ambiguous)
There was a problem hiding this comment.
I think this is giving false information.
| ) -> None: | ||
| min_value = df_series.min() | ||
| self.min = min_value if not self.min else min(self.min, min_value) | ||
| self.min = float(min_value) if not self.min else float(min(self.min, min_value)) |
There was a problem hiding this comment.
Cast added to adhere to type specification in numeric stats mixin attribute initialization (avoids setting to np.float64)
| ) -> None: | ||
| max_value = df_series.max() | ||
| self.max = max_value if not self.max else max(self.max, max_value) | ||
| self.max = float(max_value) if not self.max else float(max(self.max, max_value)) |
There was a problem hiding this comment.
Cast added to adhere to type specification in numeric stats mixin attribute initialization (avoids setting to np.float64)
|
|
||
| subset_properties["sum"] = sum_value | ||
| self.sum = self.sum + sum_value | ||
| self.sum = float(self.sum + sum_value) |
There was a problem hiding this comment.
Cast added to adhere to type specification in numeric stats mixin attribute initialization (avoids setting to np.float64)
| batch_count, | ||
| batch_biased_variance, | ||
| batch_mean, | ||
| self._biased_variance = float( |
There was a problem hiding this comment.
Cast added to adhere to type specification in numeric stats mixin attribute initialization (avoids setting to np.float64)
| num_zeros_value = (df_series == 0).sum() | ||
| subset_properties["num_zeros"] = num_zeros_value | ||
| self.num_zeros = self.num_zeros + num_zeros_value | ||
| self.num_zeros = int(self.num_zeros + num_zeros_value) |
There was a problem hiding this comment.
Cast added to adhere to type specification in numeric stats mixin attribute initialization (avoids setting to np.int64)
| ): | ||
| assert type(actual_value) == type(expected_value) | ||
| # Condition to test whether the types are equal when a value can be float or float64 | ||
| if type(actual_value) is np.float64 or type(expected_value) is np.float64: |
There was a problem hiding this comment.
Loose type comparison for float and float64 for variables that allow for both types
| """ | ||
| actual_dict = actual.__dict__ | ||
| expected_dict = expected.__dict__ | ||
| actual_dict = actual.__dict__ if not isinstance(actual, dict) else actual |
There was a problem hiding this comment.
Added for the recursive call of when an object is a nested dictionary
|
|
||
| if isinstance(actual_value, (BaseProfiler, BaseColumnProfiler)): | ||
| assert_profiles_equal(actual_value, expected_value) | ||
| elif isinstance(actual_value, dict): |
There was a problem hiding this comment.
Added for nested dictionary comparisons
8e2d15f to
03a1faa
Compare
taylorfturner
left a comment
There was a problem hiding this comment.
blocking for 0.9.0 release
03a1faa to
ec7c6c9
Compare
dismissing this because I missed that it is going to a feature branch
| self.assertIsNone(profile_column["statistics"]["min"]) | ||
| self.assertIsNone(profile_column["statistics"]["max"]) | ||
| self.assertTrue(np.isnan(profile_column["statistics"]["variance"])) | ||
| self.assertIsNone(profile_column["statistics"]["quantiles"][0]) |
* initial changes to categoricalColumn decoder (#818) * Implemented decoding for numerical stats mixin and integer profiles (#844) * hot fixes for encode and decode of numeric stats mixin and intcol profiler (#852) * Float column profiler encode decode (#854) * hot fixes for encode and decode of numeric stats mixin and intcol profiler * cleaned up type checking and updated numericstatsmixin readin helper to give type conversions to more attributes * Added docstring to the _load_stats_helper function * Update dataprofiler/profilers/numerical_column_stats.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Update dataprofiler/profilers/numerical_column_stats.py * fix for nan values issue in pytesting * Implementation of float profiler encode and decode process --------- Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Json decode date time column (#861) * more verbose error log with types for easy debug * add load_from_dict to handle tiimestamps * add json decode tests * include DateTimeColumn class * Added decoding for encoding of ordered column profiles (#864) * Added ordered col test to ensure correct response to update when different ordering of values is introduced (#868) * added decode text_column_profiler functionality and tests (#870) * Created encoder for the datalabelercolumn (#869) * feat: add test and compiler serialization (#884) * [WIP] Adds tests validating serialization with Primitive type for compiler (#885) * feat: add test and compiler serialization * fix: move primitive tests to own class * feat: add primitive col compiler save tests * fix: float serializers asserts * Adds deserialization for compilers and validates tests for Primitive; fixes numerical deserialization (#886) * feat: add test and compiler serialization * fix: move primitive tests to own class * feat: add primitive col compiler save tests * fix: float serializers asserts * feat: add tests and allow primitive compiler to deserialize * fix: bug in numeric stats deserial * fix: missing `)` after conflict resolution * Add Serialization and Deserialization Tests for Stats Compiler, plus refactors for order Typing (#887) * fix: organize categorical and add get function * refactor: reorganize tests and add stats test * feat: order typing * feat: add serial and deserial for stats compiler * fix: bug when sample_size == 0 * ready datalabeler for deserialization and improvement on serialization for datalabeler (#879) * Deserialization of datalabeler (#891) * Added initial profiler decoding for datalabeler column (WIP) * Intialial implementation for deserialization of datalabelercolumn * Fix LSP violations (#840) * Make profiler superclasses generic Makes the superclasses BaseColumnProfiler, NumericStatsMixin, and BaseCompiler generic, to avoid casting in subclass diff() methods and violating LSP in principle. * Add needed cast import --------- Co-authored-by: Junho Lee <53921230+junholee6a@users.noreply.github.com> * Encode Options (#875) * encode testing * encode dataLabeler testing * encode structuredOptions testing * cleaned up datalabeler test * added text options * [WIP] ColumnDataLabelerCompiler: serialize / deserialize (#888) * formatting * update formatting * setting up full test suite for DataLabelerCompiler * update isort * updates to test -- still failing * update * Quick Test update (#893) * update * string in list * formatting * Decode options (#894) * refactored options encode testing * updated test name * updated class names * fixing test * initial base option decode * inital tests * refactor: allow options to go through all (#902) * refactor: allow options to go through all * fix: bug * StructuredColProfiler Encode / Decode (#901) * refactor: allow options to go through all * fix: bug * update * update * update * updates * update * Fixes for taylors StructuredCol Issue * update * update * remove try/except --------- Co-authored-by: Jeremy Goodsitt <jeremy.goodsitt@gmail.com> Co-authored-by: ksneab7 <ksneab7@gmail.com> * fix: bug and add tests for structuredcolprofiler (#904) * fix: bug and add tests * fix: limit scipy requirements till problem understood and fixed * Stuctured profiler encode decode (#903) * refactor: allow options to go through all * fix: bug in loading options * update * update * Fixes for taylors StructuredCol Issue * Created load and save code from structuredprofiler * intermidiate commit for fixing structured profile --------- Co-authored-by: Jeremy Goodsitt <jeremy.goodsitt@gmail.com> Co-authored-by: taylorfturner <taylorfturner@gmail.com> * [WIP] Added NoImplementationError for UnstructuredProfiler (#907) * refactor: allow options to go through all * fix: bug in loading options * update * update * Fixes for taylors StructuredCol Issue * Created load and save code from structuredprofiler * intermidiate commit for fixing structured profile * test fix * mypy fixes for typing issues * fix for none case of the datalabler in options * Added mock of datalabeler to structured profile test * Added tests for encoding of the Structured profiler * Update dataprofiler/profilers/json_decoder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Update dataprofiler/profilers/profile_builder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Update dataprofiler/profilers/profiler_options.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Pr fixes * Fixed typo in test * Update dataprofiler/profilers/json_decoder.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Update dataprofiler/profilers/profile_builder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Update dataprofiler/tests/profilers/utils.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Update dataprofiler/profilers/profile_builder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Fixes for unneeeded callout for _profile check * small change --------- Co-authored-by: Jeremy Goodsitt <jeremy.goodsitt@gmail.com> Co-authored-by: taylorfturner <taylorfturner@gmail.com> Co-authored-by: ksneab7 <ksneab7@gmail.com> Co-authored-by: ksneab7 <91956551+ksneab7@users.noreply.github.com> * Added testing for values for test_json_decode_after_update (#915) * Reuse passed labeler (#924) * refactor: loading labeler for reuse and abstract loading * refactor: use for DataLabelerColumn as well * fix: don't error if doesn't exist * refactor: allow for config dict to be passed entire way * fix: compiler tests * fix: structCol tests * fix: test * BaseProfiler save() for json (#923) * added save for top level and tests * small refactor * small fix * refactor: use seed for sample for consistency (#927) * refactor: use seed for sample for consistency * fix: formatting and variables * WIP top level load (#925) * quick hot fix for input validation on save() save_metho (#931) * BaseProfiler: `load_method` hotfix (#932) * added load_method * updated tests * fix: null_rep mat should calculate even if datetime (#933) * Notebook Example save/load Profile (#930) * update example data profiler demo save/load * update notebook cells * Update examples/data_profiler_demo.ipynb * Update examples/data_profiler_demo.ipynb * fix: order bug (#939) * fix: typo on rebase * fix: typing and bugs from rebase * fix: options tests due to merge and loading new options --------- Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> Co-authored-by: ksneab7 <91956551+ksneab7@users.noreply.github.com> Co-authored-by: Taylor Turner <taylorfturner@gmail.com> Co-authored-by: Tyler <tfarnan@ucsd.edu> Co-authored-by: Junho Lee <53921230+junholee6a@users.noreply.github.com> Co-authored-by: ksneab7 <ksneab7@gmail.com>
* feat: add dev to workfow for testing (#897) * Reservoir sampling (#826) * add code for reservoir sampling and insert sample_nrows options * pre commit fix * add tests for reservoir sampling * fixed mypy issues * fix import to relative path --------- Co-authored-by: Taylor Turner <taylorfturner@gmail.com> Co-authored-by: Richard Bann <richard@bann.com> * [WIP] staging/dev/options (#909) * New preset implementation and test (#867) * memory optimization preset ttrying again ttrying again 3 ttrying again 4 accidentally pushed my updated makefile * Wrote catch for invalid presets, wrote test for catch for invalid presets, debugged new optimization preset * Forgot to run pre-commit, fixed those issues * black doing weird things * made preset validation more maintainable by moving it to the constructor and getting rid of preset list * RowStatisticsOptions: Add option (#865) * RowStatisticsOptions: Add null row count Added null_row_count as an option in RowStatisticsOptions. It toggles the functionality for row_has_null_ratio and row_is_null_ratio in _update_row_statistics. * Unit test for RowStatisticOptions: * Black formatting * RowStatisticsOptions: Add null row count Added null_row_count as an option in RowStatisticsOptions. It toggles the functionality for row_has_null_ratio and row_is_null_ratio in _update_row_statistics. * Unit test for RowStatisticOptions: * Black formatting * added a unit test for RowStatisticsOptions * Deleted test cases that were written in the wrong file * updated testing for null_count toggle in _update_row_statistics * removed the RowStatisticsOptions from test_profiler_options imports * add line * Created toggle option for null_count * RowStatisticsOptions: Add implementation * Revert "RowStatisticsOptions: Add implementation" This reverts commit 2da6a93. * RowStatsticsOptions: Create option * fixed pre-commit error * Update dataprofiler/profilers/profiler_options.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Update dataprofiler/profilers/profiler_options.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * fixed documentation --------- Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Preset test updated w new names and different toggles (#880) * memory optimization preset ttrying again ttrying again 3 ttrying again 4 accidentally pushed my updated makefile * trying * trying * black doing weird things * trying * made preset validation more maintainable by moving it to the constructor and getting rid of preset list * Update to open-source in prep for wrapper changes for mem op preset * updated preset toggles and preset name (mem op -> large data) * updated tests to match * continued name and test and toggle updates * fix comments * RowStatisticsOptions: Implementing option (#871) * Implementing option * Implementing option * took out redundant if statement. added test case for when null_count is disabled. * attempt to check for conflicts between profile merges * added test to check if two profilers have null_count enabled before merging them together * fixed typo and added a trycatch to prevent failing test * No mocks needed. Fixed assertRaisesRegex error * Changed variables names and added a new test to check for check the null_count when null_count is disabled. * Changed name of test, moved tests to TestStructuredProfilerRowStatistics. Fixed position of if statement to prevent unnecessary code from running. * added null_count test cases * fixed indentation mistake * fixed typo * removed a useless commented a line * Updated test name * update --------- Co-authored-by: Liz Smith <liz.smith@richmond.edu> Co-authored-by: Richard Bann <87214439+drahc1R@users.noreply.github.com> * Cms for categorical (#892) * WIP cms implementation * add heavy hitters implementation * add heavy hitters implementation * WIP: mypy issue * WIP: mypy issue * add cms bool and refactor options handler * WIP: testing for CMS * WIP: testing for CMS * use new heavy_hitters_threshold, add test for it * Reservoir sampling refactor (#910) * refactored all but tests * removed some superfluous tests * moved variables around * Staging/dev/profile serialization (#940) * initial changes to categoricalColumn decoder (#818) * Implemented decoding for numerical stats mixin and integer profiles (#844) * hot fixes for encode and decode of numeric stats mixin and intcol profiler (#852) * Float column profiler encode decode (#854) * hot fixes for encode and decode of numeric stats mixin and intcol profiler * cleaned up type checking and updated numericstatsmixin readin helper to give type conversions to more attributes * Added docstring to the _load_stats_helper function * Update dataprofiler/profilers/numerical_column_stats.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Update dataprofiler/profilers/numerical_column_stats.py * fix for nan values issue in pytesting * Implementation of float profiler encode and decode process --------- Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Json decode date time column (#861) * more verbose error log with types for easy debug * add load_from_dict to handle tiimestamps * add json decode tests * include DateTimeColumn class * Added decoding for encoding of ordered column profiles (#864) * Added ordered col test to ensure correct response to update when different ordering of values is introduced (#868) * added decode text_column_profiler functionality and tests (#870) * Created encoder for the datalabelercolumn (#869) * feat: add test and compiler serialization (#884) * [WIP] Adds tests validating serialization with Primitive type for compiler (#885) * feat: add test and compiler serialization * fix: move primitive tests to own class * feat: add primitive col compiler save tests * fix: float serializers asserts * Adds deserialization for compilers and validates tests for Primitive; fixes numerical deserialization (#886) * feat: add test and compiler serialization * fix: move primitive tests to own class * feat: add primitive col compiler save tests * fix: float serializers asserts * feat: add tests and allow primitive compiler to deserialize * fix: bug in numeric stats deserial * fix: missing `)` after conflict resolution * Add Serialization and Deserialization Tests for Stats Compiler, plus refactors for order Typing (#887) * fix: organize categorical and add get function * refactor: reorganize tests and add stats test * feat: order typing * feat: add serial and deserial for stats compiler * fix: bug when sample_size == 0 * ready datalabeler for deserialization and improvement on serialization for datalabeler (#879) * Deserialization of datalabeler (#891) * Added initial profiler decoding for datalabeler column (WIP) * Intialial implementation for deserialization of datalabelercolumn * Fix LSP violations (#840) * Make profiler superclasses generic Makes the superclasses BaseColumnProfiler, NumericStatsMixin, and BaseCompiler generic, to avoid casting in subclass diff() methods and violating LSP in principle. * Add needed cast import --------- Co-authored-by: Junho Lee <53921230+junholee6a@users.noreply.github.com> * Encode Options (#875) * encode testing * encode dataLabeler testing * encode structuredOptions testing * cleaned up datalabeler test * added text options * [WIP] ColumnDataLabelerCompiler: serialize / deserialize (#888) * formatting * update formatting * setting up full test suite for DataLabelerCompiler * update isort * updates to test -- still failing * update * Quick Test update (#893) * update * string in list * formatting * Decode options (#894) * refactored options encode testing * updated test name * updated class names * fixing test * initial base option decode * inital tests * refactor: allow options to go through all (#902) * refactor: allow options to go through all * fix: bug * StructuredColProfiler Encode / Decode (#901) * refactor: allow options to go through all * fix: bug * update * update * update * updates * update * Fixes for taylors StructuredCol Issue * update * update * remove try/except --------- Co-authored-by: Jeremy Goodsitt <jeremy.goodsitt@gmail.com> Co-authored-by: ksneab7 <ksneab7@gmail.com> * fix: bug and add tests for structuredcolprofiler (#904) * fix: bug and add tests * fix: limit scipy requirements till problem understood and fixed * Stuctured profiler encode decode (#903) * refactor: allow options to go through all * fix: bug in loading options * update * update * Fixes for taylors StructuredCol Issue * Created load and save code from structuredprofiler * intermidiate commit for fixing structured profile --------- Co-authored-by: Jeremy Goodsitt <jeremy.goodsitt@gmail.com> Co-authored-by: taylorfturner <taylorfturner@gmail.com> * [WIP] Added NoImplementationError for UnstructuredProfiler (#907) * refactor: allow options to go through all * fix: bug in loading options * update * update * Fixes for taylors StructuredCol Issue * Created load and save code from structuredprofiler * intermidiate commit for fixing structured profile * test fix * mypy fixes for typing issues * fix for none case of the datalabler in options * Added mock of datalabeler to structured profile test * Added tests for encoding of the Structured profiler * Update dataprofiler/profilers/json_decoder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Update dataprofiler/profilers/profile_builder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Update dataprofiler/profilers/profiler_options.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Pr fixes * Fixed typo in test * Update dataprofiler/profilers/json_decoder.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Update dataprofiler/profilers/profile_builder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Update dataprofiler/tests/profilers/utils.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Update dataprofiler/profilers/profile_builder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Fixes for unneeeded callout for _profile check * small change --------- Co-authored-by: Jeremy Goodsitt <jeremy.goodsitt@gmail.com> Co-authored-by: taylorfturner <taylorfturner@gmail.com> Co-authored-by: ksneab7 <ksneab7@gmail.com> Co-authored-by: ksneab7 <91956551+ksneab7@users.noreply.github.com> * Added testing for values for test_json_decode_after_update (#915) * Reuse passed labeler (#924) * refactor: loading labeler for reuse and abstract loading * refactor: use for DataLabelerColumn as well * fix: don't error if doesn't exist * refactor: allow for config dict to be passed entire way * fix: compiler tests * fix: structCol tests * fix: test * BaseProfiler save() for json (#923) * added save for top level and tests * small refactor * small fix * refactor: use seed for sample for consistency (#927) * refactor: use seed for sample for consistency * fix: formatting and variables * WIP top level load (#925) * quick hot fix for input validation on save() save_metho (#931) * BaseProfiler: `load_method` hotfix (#932) * added load_method * updated tests * fix: null_rep mat should calculate even if datetime (#933) * Notebook Example save/load Profile (#930) * update example data profiler demo save/load * update notebook cells * Update examples/data_profiler_demo.ipynb * Update examples/data_profiler_demo.ipynb * fix: order bug (#939) * fix: typo on rebase * fix: typing and bugs from rebase * fix: options tests due to merge and loading new options --------- Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> Co-authored-by: ksneab7 <91956551+ksneab7@users.noreply.github.com> Co-authored-by: Taylor Turner <taylorfturner@gmail.com> Co-authored-by: Tyler <tfarnan@ucsd.edu> Co-authored-by: Junho Lee <53921230+junholee6a@users.noreply.github.com> Co-authored-by: ksneab7 <ksneab7@gmail.com> * Hotfix: fix post feature serialization merge (#942) * fix: to use config instead of options * fix: comment * fix: maxdiff * version bump (#944) --------- Co-authored-by: JGSweets <JGSweets@users.noreply.github.com> Co-authored-by: Rushabh Vinchhi <rushabhuvinchhi@gmail.com> Co-authored-by: Richard Bann <richard@bann.com> Co-authored-by: Liz Smith <liz.smith@richmond.edu> Co-authored-by: Richard Bann <87214439+drahc1R@users.noreply.github.com> Co-authored-by: Tyler <tfarnan@ucsd.edu> Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> Co-authored-by: ksneab7 <91956551+ksneab7@users.noreply.github.com> Co-authored-by: Junho Lee <53921230+junholee6a@users.noreply.github.com> Co-authored-by: ksneab7 <ksneab7@gmail.com>
No description provided.