Staging/dev/profile serialization#940
Conversation
* hot fixes for encode and decode of numeric stats mixin and intcol profiler * cleaned up type checking and updated numericstatsmixin readin helper to give type conversions to more attributes * Added docstring to the _load_stats_helper function * Update dataprofiler/profilers/numerical_column_stats.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Update dataprofiler/profilers/numerical_column_stats.py * fix for nan values issue in pytesting * Implementation of float profiler encode and decode process --------- Co-authored-by: Taylor Turner <taylorfturner@gmail.com>
* more verbose error log with types for easy debug * add load_from_dict to handle tiimestamps * add json decode tests * include DateTimeColumn class
…erent ordering of values is introduced (capitalone#868)
…piler (capitalone#885) * feat: add test and compiler serialization * fix: move primitive tests to own class * feat: add primitive col compiler save tests * fix: float serializers asserts
… fixes numerical deserialization (capitalone#886) * feat: add test and compiler serialization * fix: move primitive tests to own class * feat: add primitive col compiler save tests * fix: float serializers asserts * feat: add tests and allow primitive compiler to deserialize * fix: bug in numeric stats deserial * fix: missing `)` after conflict resolution
…refactors for order Typing (capitalone#887) * fix: organize categorical and add get function * refactor: reorganize tests and add stats test * feat: order typing * feat: add serial and deserial for stats compiler * fix: bug when sample_size == 0
…n for datalabeler (capitalone#879)
* Added initial profiler decoding for datalabeler column (WIP) * Intialial implementation for deserialization of datalabelercolumn * Fix LSP violations (capitalone#840) * Make profiler superclasses generic Makes the superclasses BaseColumnProfiler, NumericStatsMixin, and BaseCompiler generic, to avoid casting in subclass diff() methods and violating LSP in principle. * Add needed cast import --------- Co-authored-by: Junho Lee <53921230+junholee6a@users.noreply.github.com>
* encode testing * encode dataLabeler testing * encode structuredOptions testing * cleaned up datalabeler test * added text options
* update * string in list * formatting
* refactored options encode testing * updated test name * updated class names * fixing test * initial base option decode * inital tests
* refactor: allow options to go through all * fix: bug
* refactor: allow options to go through all * fix: bug * update * update * update * updates * update * Fixes for taylors StructuredCol Issue * update * update * remove try/except --------- Co-authored-by: Jeremy Goodsitt <jeremy.goodsitt@gmail.com> Co-authored-by: ksneab7 <ksneab7@gmail.com>
* fix: bug and add tests * fix: limit scipy requirements till problem understood and fixed
* refactor: allow options to go through all * fix: bug in loading options * update * update * Fixes for taylors StructuredCol Issue * Created load and save code from structuredprofiler * intermidiate commit for fixing structured profile --------- Co-authored-by: Jeremy Goodsitt <jeremy.goodsitt@gmail.com> Co-authored-by: taylorfturner <taylorfturner@gmail.com>
…e#907) * refactor: allow options to go through all * fix: bug in loading options * update * update * Fixes for taylors StructuredCol Issue * Created load and save code from structuredprofiler * intermidiate commit for fixing structured profile * test fix * mypy fixes for typing issues * fix for none case of the datalabler in options * Added mock of datalabeler to structured profile test * Added tests for encoding of the Structured profiler * Update dataprofiler/profilers/json_decoder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Update dataprofiler/profilers/profile_builder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Update dataprofiler/profilers/profiler_options.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Pr fixes * Fixed typo in test * Update dataprofiler/profilers/json_decoder.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Update dataprofiler/profilers/profile_builder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Update dataprofiler/tests/profilers/utils.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Update dataprofiler/profilers/profile_builder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Fixes for unneeeded callout for _profile check * small change --------- Co-authored-by: Jeremy Goodsitt <jeremy.goodsitt@gmail.com> Co-authored-by: taylorfturner <taylorfturner@gmail.com> Co-authored-by: ksneab7 <ksneab7@gmail.com> Co-authored-by: ksneab7 <91956551+ksneab7@users.noreply.github.com>
* refactor: loading labeler for reuse and abstract loading * refactor: use for DataLabelerColumn as well * fix: don't error if doesn't exist * refactor: allow for config dict to be passed entire way * fix: compiler tests * fix: structCol tests * fix: test
* added save for top level and tests * small refactor * small fix
* refactor: use seed for sample for consistency * fix: formatting and variables
| StructuredProfiler, | ||
| UnstructuredProfiler, | ||
| ) | ||
| from .profiler_options import ( |
There was a problem hiding this comment.
updated for dev options that weren't in profiler serial
| ColumnStatsProfileCompiler.__name__: ColumnStatsProfileCompiler, | ||
| } | ||
|
|
||
| json_decoder._options = { |
There was a problem hiding this comment.
updated these from dev
| categories.pop(cat) | ||
| return cms3, categories, max_num_heavy_hitters | ||
|
|
||
| def _get_categories_full(self, df_series) -> dict: |
There was a problem hiding this comment.
added descript and renamed method
| self._cms_max_num_heavy_hitters, | ||
| ) | ||
| else: | ||
| category_count = df_series.value_counts(dropna=False).to_dict() |
There was a problem hiding this comment.
fixed to use method
| from datetime import datetime | ||
| from multiprocessing.pool import Pool | ||
| from typing import Any, Generator, List, Optional, cast | ||
| from typing import Any, Generator, List, Optional, TypeVar, cast |
| "times": self.times, | ||
| } | ||
| self._save_helper(filepath, data_dict) | ||
| save_method = save_method.lower() |
There was a problem hiding this comment.
fixed this for new method
| except Exception: | ||
| data["_col_name_to_idx"] = defaultdict(list, data["_col_name_to_idx"]) | ||
|
|
||
| data["hashed_row_object"] = { |
There was a problem hiding this comment.
fixed all things with hashed_row_dict -> hashed_row_object
|
|
||
| self._save_helper(filepath, data_dict) | ||
| save_method = save_method.lower() | ||
| if save_method == "pickle": |
|
|
||
|
|
||
| class HyperLogLogOptions(BaseOption): | ||
| class HyperLogLogOptions(BaseOption["HyperLogLogOptions"]): |
There was a problem hiding this comment.
fixed naming for new options
| self.chi2_homogeneity = BooleanOption(is_enabled=True) | ||
| self.null_replication_metrics = BooleanOption(is_enabled=False) | ||
| self.row_statistics = RowStatisticsOptions() | ||
| self.multiprocess: BooleanOption = BooleanOption() |
There was a problem hiding this comment.
fixed some of these types
| options2.is_enabled = False | ||
| self.assertEqual(options, options2) | ||
|
|
||
| def test_json_encode(self): |
There was a problem hiding this comment.
fixed this to use get_options otherwise it wouldn't raise errors for new tests that weren't properly updated
| def test_eq(self): | ||
| super().test_eq() | ||
|
|
||
| def test_json_encode(self): |
|
|
||
| serialized = json.dumps(option, cls=ProfileEncoder) | ||
|
|
||
| expected = { |
There was a problem hiding this comment.
fixed for new options
| options2.bin_count_or_method = "sturges" | ||
| self.assertEqual(options, options2) | ||
|
|
||
| def test_json_encode(self): |
| options2.register_count = 1 | ||
| self.assertEqual(options, options2) | ||
|
|
||
| def test_json_encode(self): |
| options2.is_enabled = False | ||
| self.assertEqual(options, options2) | ||
|
|
||
| def test_json_encode(self): |
| null_values={"str": 1}, column_null_values={2: {"other_str": 5}} | ||
| ) | ||
|
|
||
| serialized = json.dumps(option, cls=ProfileEncoder) |
| test_root_path = os.path.dirname(os.path.dirname(os.path.realpath(__file__))) | ||
|
|
||
|
|
||
| def setup_save_mock_open(mock_open): |
There was a problem hiding this comment.
fixed lower tests to use this
| # hashed_row_object due to specificity of values | ||
| serialized_hashed_row_object = serialized_dict["data"].pop("hashed_row_object") |
There was a problem hiding this comment.
fixed all cases in this file to use hashed_row_object
|
|
||
| :param serialized_json: JSON representation of column profiler that was | ||
| serialized using the custom encoder in profilers.json_encoder | ||
| # serialized using the custom encoder in profilers.json_encoder |
There was a problem hiding this comment.
this doesn't look right....
There was a problem hiding this comment.
the # ... not the indent
There was a problem hiding this comment.
it does not, can we add to the hotfix after?
| AliasFloatType = Type[np.float64] | ||
| AliasStrType = Type[str] |
| piecewise = False | ||
|
|
||
| return order, cast(int, first_value), cast(int, last_value), piecewise | ||
| return order, first_value, last_value, piecewise, merged_data_store_type |
| "column_null_values": {"2": {"other_str": 5}}, | ||
| }, | ||
| } | ||
| self.maxDiff = None |
| option = ModeOption(is_enabled=False, max_k_modes=5) | ||
|
|
||
| serialized = json.dumps(option, cls=ProfileEncoder) | ||
|
|
||
| expected = { | ||
| "class": "ModeOption", | ||
| "data": {"is_enabled": False, "top_k_modes": 5}, |
There was a problem hiding this comment.
bad naming... but is what is for now
| "column_null_values": {"2": {"other_str": 5}}, | ||
| }, | ||
| } | ||
| self.maxDiff = None |
There was a problem hiding this comment.
does this need to be removed?
* feat: add dev to workfow for testing (#897) * Reservoir sampling (#826) * add code for reservoir sampling and insert sample_nrows options * pre commit fix * add tests for reservoir sampling * fixed mypy issues * fix import to relative path --------- Co-authored-by: Taylor Turner <taylorfturner@gmail.com> Co-authored-by: Richard Bann <richard@bann.com> * [WIP] staging/dev/options (#909) * New preset implementation and test (#867) * memory optimization preset ttrying again ttrying again 3 ttrying again 4 accidentally pushed my updated makefile * Wrote catch for invalid presets, wrote test for catch for invalid presets, debugged new optimization preset * Forgot to run pre-commit, fixed those issues * black doing weird things * made preset validation more maintainable by moving it to the constructor and getting rid of preset list * RowStatisticsOptions: Add option (#865) * RowStatisticsOptions: Add null row count Added null_row_count as an option in RowStatisticsOptions. It toggles the functionality for row_has_null_ratio and row_is_null_ratio in _update_row_statistics. * Unit test for RowStatisticOptions: * Black formatting * RowStatisticsOptions: Add null row count Added null_row_count as an option in RowStatisticsOptions. It toggles the functionality for row_has_null_ratio and row_is_null_ratio in _update_row_statistics. * Unit test for RowStatisticOptions: * Black formatting * added a unit test for RowStatisticsOptions * Deleted test cases that were written in the wrong file * updated testing for null_count toggle in _update_row_statistics * removed the RowStatisticsOptions from test_profiler_options imports * add line * Created toggle option for null_count * RowStatisticsOptions: Add implementation * Revert "RowStatisticsOptions: Add implementation" This reverts commit 2da6a93. * RowStatsticsOptions: Create option * fixed pre-commit error * Update dataprofiler/profilers/profiler_options.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Update dataprofiler/profilers/profiler_options.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * fixed documentation --------- Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Preset test updated w new names and different toggles (#880) * memory optimization preset ttrying again ttrying again 3 ttrying again 4 accidentally pushed my updated makefile * trying * trying * black doing weird things * trying * made preset validation more maintainable by moving it to the constructor and getting rid of preset list * Update to open-source in prep for wrapper changes for mem op preset * updated preset toggles and preset name (mem op -> large data) * updated tests to match * continued name and test and toggle updates * fix comments * RowStatisticsOptions: Implementing option (#871) * Implementing option * Implementing option * took out redundant if statement. added test case for when null_count is disabled. * attempt to check for conflicts between profile merges * added test to check if two profilers have null_count enabled before merging them together * fixed typo and added a trycatch to prevent failing test * No mocks needed. Fixed assertRaisesRegex error * Changed variables names and added a new test to check for check the null_count when null_count is disabled. * Changed name of test, moved tests to TestStructuredProfilerRowStatistics. Fixed position of if statement to prevent unnecessary code from running. * added null_count test cases * fixed indentation mistake * fixed typo * removed a useless commented a line * Updated test name * update --------- Co-authored-by: Liz Smith <liz.smith@richmond.edu> Co-authored-by: Richard Bann <87214439+drahc1R@users.noreply.github.com> * Cms for categorical (#892) * WIP cms implementation * add heavy hitters implementation * add heavy hitters implementation * WIP: mypy issue * WIP: mypy issue * add cms bool and refactor options handler * WIP: testing for CMS * WIP: testing for CMS * use new heavy_hitters_threshold, add test for it * Reservoir sampling refactor (#910) * refactored all but tests * removed some superfluous tests * moved variables around * Staging/dev/profile serialization (#940) * initial changes to categoricalColumn decoder (#818) * Implemented decoding for numerical stats mixin and integer profiles (#844) * hot fixes for encode and decode of numeric stats mixin and intcol profiler (#852) * Float column profiler encode decode (#854) * hot fixes for encode and decode of numeric stats mixin and intcol profiler * cleaned up type checking and updated numericstatsmixin readin helper to give type conversions to more attributes * Added docstring to the _load_stats_helper function * Update dataprofiler/profilers/numerical_column_stats.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Update dataprofiler/profilers/numerical_column_stats.py * fix for nan values issue in pytesting * Implementation of float profiler encode and decode process --------- Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Json decode date time column (#861) * more verbose error log with types for easy debug * add load_from_dict to handle tiimestamps * add json decode tests * include DateTimeColumn class * Added decoding for encoding of ordered column profiles (#864) * Added ordered col test to ensure correct response to update when different ordering of values is introduced (#868) * added decode text_column_profiler functionality and tests (#870) * Created encoder for the datalabelercolumn (#869) * feat: add test and compiler serialization (#884) * [WIP] Adds tests validating serialization with Primitive type for compiler (#885) * feat: add test and compiler serialization * fix: move primitive tests to own class * feat: add primitive col compiler save tests * fix: float serializers asserts * Adds deserialization for compilers and validates tests for Primitive; fixes numerical deserialization (#886) * feat: add test and compiler serialization * fix: move primitive tests to own class * feat: add primitive col compiler save tests * fix: float serializers asserts * feat: add tests and allow primitive compiler to deserialize * fix: bug in numeric stats deserial * fix: missing `)` after conflict resolution * Add Serialization and Deserialization Tests for Stats Compiler, plus refactors for order Typing (#887) * fix: organize categorical and add get function * refactor: reorganize tests and add stats test * feat: order typing * feat: add serial and deserial for stats compiler * fix: bug when sample_size == 0 * ready datalabeler for deserialization and improvement on serialization for datalabeler (#879) * Deserialization of datalabeler (#891) * Added initial profiler decoding for datalabeler column (WIP) * Intialial implementation for deserialization of datalabelercolumn * Fix LSP violations (#840) * Make profiler superclasses generic Makes the superclasses BaseColumnProfiler, NumericStatsMixin, and BaseCompiler generic, to avoid casting in subclass diff() methods and violating LSP in principle. * Add needed cast import --------- Co-authored-by: Junho Lee <53921230+junholee6a@users.noreply.github.com> * Encode Options (#875) * encode testing * encode dataLabeler testing * encode structuredOptions testing * cleaned up datalabeler test * added text options * [WIP] ColumnDataLabelerCompiler: serialize / deserialize (#888) * formatting * update formatting * setting up full test suite for DataLabelerCompiler * update isort * updates to test -- still failing * update * Quick Test update (#893) * update * string in list * formatting * Decode options (#894) * refactored options encode testing * updated test name * updated class names * fixing test * initial base option decode * inital tests * refactor: allow options to go through all (#902) * refactor: allow options to go through all * fix: bug * StructuredColProfiler Encode / Decode (#901) * refactor: allow options to go through all * fix: bug * update * update * update * updates * update * Fixes for taylors StructuredCol Issue * update * update * remove try/except --------- Co-authored-by: Jeremy Goodsitt <jeremy.goodsitt@gmail.com> Co-authored-by: ksneab7 <ksneab7@gmail.com> * fix: bug and add tests for structuredcolprofiler (#904) * fix: bug and add tests * fix: limit scipy requirements till problem understood and fixed * Stuctured profiler encode decode (#903) * refactor: allow options to go through all * fix: bug in loading options * update * update * Fixes for taylors StructuredCol Issue * Created load and save code from structuredprofiler * intermidiate commit for fixing structured profile --------- Co-authored-by: Jeremy Goodsitt <jeremy.goodsitt@gmail.com> Co-authored-by: taylorfturner <taylorfturner@gmail.com> * [WIP] Added NoImplementationError for UnstructuredProfiler (#907) * refactor: allow options to go through all * fix: bug in loading options * update * update * Fixes for taylors StructuredCol Issue * Created load and save code from structuredprofiler * intermidiate commit for fixing structured profile * test fix * mypy fixes for typing issues * fix for none case of the datalabler in options * Added mock of datalabeler to structured profile test * Added tests for encoding of the Structured profiler * Update dataprofiler/profilers/json_decoder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Update dataprofiler/profilers/profile_builder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Update dataprofiler/profilers/profiler_options.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Pr fixes * Fixed typo in test * Update dataprofiler/profilers/json_decoder.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Update dataprofiler/profilers/profile_builder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Update dataprofiler/tests/profilers/utils.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Update dataprofiler/profilers/profile_builder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Fixes for unneeeded callout for _profile check * small change --------- Co-authored-by: Jeremy Goodsitt <jeremy.goodsitt@gmail.com> Co-authored-by: taylorfturner <taylorfturner@gmail.com> Co-authored-by: ksneab7 <ksneab7@gmail.com> Co-authored-by: ksneab7 <91956551+ksneab7@users.noreply.github.com> * Added testing for values for test_json_decode_after_update (#915) * Reuse passed labeler (#924) * refactor: loading labeler for reuse and abstract loading * refactor: use for DataLabelerColumn as well * fix: don't error if doesn't exist * refactor: allow for config dict to be passed entire way * fix: compiler tests * fix: structCol tests * fix: test * BaseProfiler save() for json (#923) * added save for top level and tests * small refactor * small fix * refactor: use seed for sample for consistency (#927) * refactor: use seed for sample for consistency * fix: formatting and variables * WIP top level load (#925) * quick hot fix for input validation on save() save_metho (#931) * BaseProfiler: `load_method` hotfix (#932) * added load_method * updated tests * fix: null_rep mat should calculate even if datetime (#933) * Notebook Example save/load Profile (#930) * update example data profiler demo save/load * update notebook cells * Update examples/data_profiler_demo.ipynb * Update examples/data_profiler_demo.ipynb * fix: order bug (#939) * fix: typo on rebase * fix: typing and bugs from rebase * fix: options tests due to merge and loading new options --------- Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> Co-authored-by: ksneab7 <91956551+ksneab7@users.noreply.github.com> Co-authored-by: Taylor Turner <taylorfturner@gmail.com> Co-authored-by: Tyler <tfarnan@ucsd.edu> Co-authored-by: Junho Lee <53921230+junholee6a@users.noreply.github.com> Co-authored-by: ksneab7 <ksneab7@gmail.com> * Hotfix: fix post feature serialization merge (#942) * fix: to use config instead of options * fix: comment * fix: maxdiff * version bump (#944) --------- Co-authored-by: JGSweets <JGSweets@users.noreply.github.com> Co-authored-by: Rushabh Vinchhi <rushabhuvinchhi@gmail.com> Co-authored-by: Richard Bann <richard@bann.com> Co-authored-by: Liz Smith <liz.smith@richmond.edu> Co-authored-by: Richard Bann <87214439+drahc1R@users.noreply.github.com> Co-authored-by: Tyler <tfarnan@ucsd.edu> Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> Co-authored-by: ksneab7 <91956551+ksneab7@users.noreply.github.com> Co-authored-by: Junho Lee <53921230+junholee6a@users.noreply.github.com> Co-authored-by: ksneab7 <ksneab7@gmail.com>
Work in progress to fill this section...