Skip to content

Comments

refactor: allow options to go through all#902

Merged
taylorfturner merged 3 commits intocapitalone:feature/profile-serializationfrom
JGSweets:fix-option-passing
Jun 23, 2023
Merged

refactor: allow options to go through all#902
taylorfturner merged 3 commits intocapitalone:feature/profile-serializationfrom
JGSweets:fix-option-passing

Conversation

@JGSweets
Copy link
Contributor

This PR adds options to load_from_dict to:

  • low level column profilers
  • compilers
  • options
  • adds a test to validate passing of data_labeler through options from a higher level.

@JGSweets JGSweets added Medium Priority Significant improvement or bug / feature reducing overall performance Refactor Code that is being modified to improve the library labels Jun 23, 2023
@taylorfturner taylorfturner enabled auto-merge (squash) June 23, 2023 16:35
"data_label_representation", None
) == {"a": 0.6, "b": 0.4}

def test_json_decode_with_options(self, mock_instance):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is really hard for me to read. There is a lot of mock setting. The variable naming between mock_instance and new_mock_data_labeler makes it difficult to discern their roles in this test. Can this be cleaned up? If it passes, I am fine approving, since this is more readability vs functionality

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much improved now IMHO

@taylorfturner taylorfturner merged commit ee1f602 into capitalone:feature/profile-serialization Jun 23, 2023
JGSweets added a commit to JGSweets/data-profiler that referenced this pull request Jun 29, 2023
* refactor: allow options to go through all

* fix: bug
taylorfturner added a commit that referenced this pull request Jun 29, 2023
* initial changes to categoricalColumn decoder (#818)

* Implemented decoding for numerical stats mixin and integer profiles (#844)

* hot fixes for encode and decode of numeric stats mixin and intcol profiler (#852)

* Float column profiler encode decode (#854)

* hot fixes for encode and decode of numeric stats mixin and intcol profiler

* cleaned up type checking and updated numericstatsmixin readin helper to give type conversions to more attributes

* Added docstring to the _load_stats_helper function

* Update dataprofiler/profilers/numerical_column_stats.py

Co-authored-by: Taylor Turner <taylorfturner@gmail.com>

* Update dataprofiler/profilers/numerical_column_stats.py

* fix for nan values issue in pytesting

* Implementation of float profiler encode and decode process

---------

Co-authored-by: Taylor Turner <taylorfturner@gmail.com>

* Json decode date time column (#861)

* more verbose error log with types for easy debug

* add load_from_dict to handle tiimestamps

* add json decode tests

* include DateTimeColumn class

* Added decoding for encoding of ordered column profiles (#864)

* Added ordered col test to ensure correct response to update when different ordering of values is introduced (#868)

* added decode text_column_profiler functionality and tests (#870)

* Created encoder for the datalabelercolumn (#869)

* feat: add test and compiler serialization (#884)

* [WIP] Adds tests validating serialization with Primitive type for compiler (#885)

* feat: add test and compiler serialization

* fix: move primitive tests to own class

* feat: add primitive col compiler save tests

* fix: float serializers asserts

* Adds deserialization for compilers and validates tests for Primitive; fixes numerical deserialization (#886)

* feat: add test and compiler serialization

* fix: move primitive tests to own class

* feat: add primitive col compiler save tests

* fix: float serializers asserts

* feat: add tests and allow primitive compiler to deserialize

* fix: bug in numeric stats deserial

* fix: missing `)` after conflict resolution

* Add Serialization and Deserialization Tests for Stats Compiler, plus refactors for order Typing (#887)

* fix: organize categorical and add get function

* refactor: reorganize tests and add stats test

* feat: order typing

* feat: add serial and deserial for stats compiler

* fix: bug when sample_size == 0

* ready datalabeler for deserialization and improvement on serialization for datalabeler (#879)

* Deserialization of datalabeler (#891)

* Added initial profiler decoding for datalabeler column (WIP)

* Intialial implementation for deserialization of datalabelercolumn

* Fix LSP violations (#840)

* Make profiler superclasses generic

Makes the superclasses BaseColumnProfiler, NumericStatsMixin, and
BaseCompiler generic, to avoid casting in subclass diff() methods and
violating LSP in principle.

* Add needed cast import

---------

Co-authored-by: Junho Lee <53921230+junholee6a@users.noreply.github.com>

* Encode Options (#875)

* encode testing

* encode dataLabeler testing

* encode structuredOptions testing

* cleaned up datalabeler test

* added text options

* [WIP] ColumnDataLabelerCompiler: serialize / deserialize (#888)

* formatting

* update formatting

* setting up full test suite for DataLabelerCompiler

* update isort

* updates to test -- still failing

* update

* Quick Test update (#893)

* update

* string in list

* formatting

* Decode options (#894)

* refactored options encode testing

* updated test name

* updated class names

* fixing test

* initial base option decode

* inital tests

* refactor: allow options to go through all (#902)

* refactor: allow options to go through all

* fix: bug

* StructuredColProfiler Encode / Decode  (#901)

* refactor: allow options to go through all

* fix: bug

* update

* update

* update

* updates

* update

* Fixes for taylors StructuredCol Issue

* update

* update

* remove try/except

---------

Co-authored-by: Jeremy Goodsitt <jeremy.goodsitt@gmail.com>
Co-authored-by: ksneab7 <ksneab7@gmail.com>

* fix: bug and add tests for structuredcolprofiler (#904)

* fix: bug and add tests

* fix: limit scipy requirements till problem understood and fixed

* Stuctured profiler encode decode (#903)

* refactor: allow options to go through all

* fix: bug in loading options

* update

* update

* Fixes for taylors StructuredCol Issue

* Created load and save code from structuredprofiler

* intermidiate commit for fixing structured profile

---------

Co-authored-by: Jeremy Goodsitt <jeremy.goodsitt@gmail.com>
Co-authored-by: taylorfturner <taylorfturner@gmail.com>

* [WIP] Added NoImplementationError for UnstructuredProfiler (#907)

* refactor: allow options to go through all

* fix: bug in loading options

* update

* update

* Fixes for taylors StructuredCol Issue

* Created load and save code from structuredprofiler

* intermidiate commit for fixing structured profile

* test fix

* mypy fixes for typing issues

* fix for none case of the datalabler in options

* Added mock of datalabeler to structured profile test

* Added tests for encoding of the Structured profiler

* Update dataprofiler/profilers/json_decoder.py

Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com>

* Update dataprofiler/profilers/profile_builder.py

Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com>

* Update dataprofiler/profilers/profiler_options.py

Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com>

* Pr fixes

* Fixed typo in test

* Update dataprofiler/profilers/json_decoder.py

Co-authored-by: Taylor Turner <taylorfturner@gmail.com>

* Update dataprofiler/profilers/profile_builder.py

Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com>

* Update dataprofiler/tests/profilers/utils.py

Co-authored-by: Taylor Turner <taylorfturner@gmail.com>

* Update dataprofiler/profilers/profile_builder.py

Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com>

* Fixes for unneeeded callout for _profile check

* small change

---------

Co-authored-by: Jeremy Goodsitt <jeremy.goodsitt@gmail.com>
Co-authored-by: taylorfturner <taylorfturner@gmail.com>
Co-authored-by: ksneab7 <ksneab7@gmail.com>
Co-authored-by: ksneab7 <91956551+ksneab7@users.noreply.github.com>

* Added testing for values for test_json_decode_after_update (#915)

* Reuse passed labeler (#924)

* refactor: loading labeler for reuse and abstract loading

* refactor: use for DataLabelerColumn as well

* fix: don't error if doesn't exist

* refactor: allow for config dict to be passed entire way

* fix: compiler tests

* fix: structCol tests

* fix: test

* BaseProfiler save() for json (#923)

* added save for top level and tests

* small refactor

* small fix

* refactor: use seed for sample for consistency (#927)

* refactor: use seed for sample for consistency

* fix: formatting and variables

* WIP top level load (#925)

* quick hot fix for input validation on save() save_metho (#931)

* BaseProfiler: `load_method` hotfix (#932)

* added load_method

* updated tests

* fix: null_rep mat should calculate even if datetime (#933)

* Notebook Example save/load Profile (#930)

* update example data profiler demo save/load

* update notebook cells

* Update examples/data_profiler_demo.ipynb

* Update examples/data_profiler_demo.ipynb

* fix: order bug (#939)

* fix: typo on rebase

* fix: typing and bugs from rebase

* fix: options tests due to merge and loading new options

---------

Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com>
Co-authored-by: ksneab7 <91956551+ksneab7@users.noreply.github.com>
Co-authored-by: Taylor Turner <taylorfturner@gmail.com>
Co-authored-by: Tyler <tfarnan@ucsd.edu>
Co-authored-by: Junho Lee <53921230+junholee6a@users.noreply.github.com>
Co-authored-by: ksneab7 <ksneab7@gmail.com>
micdavis added a commit that referenced this pull request Jun 29, 2023
* feat: add dev to workfow for testing (#897)

* Reservoir sampling (#826)

* add code for reservoir sampling and insert sample_nrows options

* pre commit fix

* add tests for reservoir sampling

* fixed mypy issues

* fix import to relative path

---------

Co-authored-by: Taylor Turner <taylorfturner@gmail.com>
Co-authored-by: Richard Bann <richard@bann.com>

* [WIP] staging/dev/options (#909)

* New preset implementation and test (#867)

* memory optimization preset

ttrying again

ttrying again 3

ttrying again 4

accidentally pushed my updated makefile

* Wrote catch for invalid presets, wrote test for catch for invalid presets, debugged new optimization preset

* Forgot to run pre-commit, fixed those issues

* black doing weird things

* made preset validation more maintainable by moving it to the constructor and getting rid of preset list

* RowStatisticsOptions: Add option (#865)

* RowStatisticsOptions: Add null row count

Added null_row_count as an option in RowStatisticsOptions. It toggles the functionality for row_has_null_ratio and row_is_null_ratio in _update_row_statistics.

* Unit test for RowStatisticOptions:

* Black formatting

* RowStatisticsOptions: Add null row count

Added null_row_count as an option in RowStatisticsOptions. It toggles the functionality for row_has_null_ratio and row_is_null_ratio in _update_row_statistics.

* Unit test for RowStatisticOptions:

* Black formatting

* added a unit test for RowStatisticsOptions

* Deleted test cases that were written in the wrong file

* updated testing for null_count toggle in _update_row_statistics

* removed the RowStatisticsOptions from test_profiler_options imports

* add line

* Created toggle option for null_count

* RowStatisticsOptions: Add implementation

* Revert "RowStatisticsOptions: Add implementation"

This reverts commit 2da6a93.

* RowStatsticsOptions: Create option

* fixed pre-commit error

* Update dataprofiler/profilers/profiler_options.py

Co-authored-by: Taylor Turner <taylorfturner@gmail.com>

* Update dataprofiler/profilers/profiler_options.py

Co-authored-by: Taylor Turner <taylorfturner@gmail.com>

* fixed documentation

---------

Co-authored-by: Taylor Turner <taylorfturner@gmail.com>

* Preset test updated w new names and different toggles (#880)

* memory optimization preset

ttrying again

ttrying again 3

ttrying again 4

accidentally pushed my updated makefile

* trying

* trying

* black doing weird things

* trying

* made preset validation more maintainable by moving it to the constructor and getting rid of preset list

* Update to open-source in prep for wrapper changes for mem op preset

* updated preset toggles and preset name (mem op -> large data)

* updated tests to match

* continued name and test and toggle updates

* fix comments

* RowStatisticsOptions: Implementing option (#871)

* Implementing option

* Implementing option

* took out redundant if statement. added test case for when null_count is disabled.

* attempt to check for conflicts between profile merges

* added test to check if two profilers have null_count enabled before merging them together

* fixed typo and added a trycatch to prevent failing test

* No mocks needed. Fixed assertRaisesRegex error

* Changed variables names and added a new test to check for check the null_count when null_count is disabled.

* Changed name of test, moved tests to TestStructuredProfilerRowStatistics. Fixed position of if statement to prevent unnecessary code from running.

* added null_count test cases

* fixed indentation mistake

* fixed typo

* removed a useless commented a line

* Updated test name

* update

---------

Co-authored-by: Liz Smith <liz.smith@richmond.edu>
Co-authored-by: Richard Bann <87214439+drahc1R@users.noreply.github.com>

* Cms for categorical (#892)

* WIP cms implementation

* add heavy hitters implementation

* add heavy hitters implementation

* WIP: mypy issue

* WIP: mypy issue

* add cms bool and refactor options handler

* WIP: testing for CMS

* WIP: testing for CMS

* use new heavy_hitters_threshold, add test for it

* Reservoir sampling refactor (#910)

* refactored all but tests

* removed some superfluous tests

* moved variables around

* Staging/dev/profile serialization (#940)

* initial changes to categoricalColumn decoder (#818)

* Implemented decoding for numerical stats mixin and integer profiles (#844)

* hot fixes for encode and decode of numeric stats mixin and intcol profiler (#852)

* Float column profiler encode decode (#854)

* hot fixes for encode and decode of numeric stats mixin and intcol profiler

* cleaned up type checking and updated numericstatsmixin readin helper to give type conversions to more attributes

* Added docstring to the _load_stats_helper function

* Update dataprofiler/profilers/numerical_column_stats.py

Co-authored-by: Taylor Turner <taylorfturner@gmail.com>

* Update dataprofiler/profilers/numerical_column_stats.py

* fix for nan values issue in pytesting

* Implementation of float profiler encode and decode process

---------

Co-authored-by: Taylor Turner <taylorfturner@gmail.com>

* Json decode date time column (#861)

* more verbose error log with types for easy debug

* add load_from_dict to handle tiimestamps

* add json decode tests

* include DateTimeColumn class

* Added decoding for encoding of ordered column profiles (#864)

* Added ordered col test to ensure correct response to update when different ordering of values is introduced (#868)

* added decode text_column_profiler functionality and tests (#870)

* Created encoder for the datalabelercolumn (#869)

* feat: add test and compiler serialization (#884)

* [WIP] Adds tests validating serialization with Primitive type for compiler (#885)

* feat: add test and compiler serialization

* fix: move primitive tests to own class

* feat: add primitive col compiler save tests

* fix: float serializers asserts

* Adds deserialization for compilers and validates tests for Primitive; fixes numerical deserialization (#886)

* feat: add test and compiler serialization

* fix: move primitive tests to own class

* feat: add primitive col compiler save tests

* fix: float serializers asserts

* feat: add tests and allow primitive compiler to deserialize

* fix: bug in numeric stats deserial

* fix: missing `)` after conflict resolution

* Add Serialization and Deserialization Tests for Stats Compiler, plus refactors for order Typing (#887)

* fix: organize categorical and add get function

* refactor: reorganize tests and add stats test

* feat: order typing

* feat: add serial and deserial for stats compiler

* fix: bug when sample_size == 0

* ready datalabeler for deserialization and improvement on serialization for datalabeler (#879)

* Deserialization of datalabeler (#891)

* Added initial profiler decoding for datalabeler column (WIP)

* Intialial implementation for deserialization of datalabelercolumn

* Fix LSP violations (#840)

* Make profiler superclasses generic

Makes the superclasses BaseColumnProfiler, NumericStatsMixin, and
BaseCompiler generic, to avoid casting in subclass diff() methods and
violating LSP in principle.

* Add needed cast import

---------

Co-authored-by: Junho Lee <53921230+junholee6a@users.noreply.github.com>

* Encode Options (#875)

* encode testing

* encode dataLabeler testing

* encode structuredOptions testing

* cleaned up datalabeler test

* added text options

* [WIP] ColumnDataLabelerCompiler: serialize / deserialize (#888)

* formatting

* update formatting

* setting up full test suite for DataLabelerCompiler

* update isort

* updates to test -- still failing

* update

* Quick Test update (#893)

* update

* string in list

* formatting

* Decode options (#894)

* refactored options encode testing

* updated test name

* updated class names

* fixing test

* initial base option decode

* inital tests

* refactor: allow options to go through all (#902)

* refactor: allow options to go through all

* fix: bug

* StructuredColProfiler Encode / Decode  (#901)

* refactor: allow options to go through all

* fix: bug

* update

* update

* update

* updates

* update

* Fixes for taylors StructuredCol Issue

* update

* update

* remove try/except

---------

Co-authored-by: Jeremy Goodsitt <jeremy.goodsitt@gmail.com>
Co-authored-by: ksneab7 <ksneab7@gmail.com>

* fix: bug and add tests for structuredcolprofiler (#904)

* fix: bug and add tests

* fix: limit scipy requirements till problem understood and fixed

* Stuctured profiler encode decode (#903)

* refactor: allow options to go through all

* fix: bug in loading options

* update

* update

* Fixes for taylors StructuredCol Issue

* Created load and save code from structuredprofiler

* intermidiate commit for fixing structured profile

---------

Co-authored-by: Jeremy Goodsitt <jeremy.goodsitt@gmail.com>
Co-authored-by: taylorfturner <taylorfturner@gmail.com>

* [WIP] Added NoImplementationError for UnstructuredProfiler (#907)

* refactor: allow options to go through all

* fix: bug in loading options

* update

* update

* Fixes for taylors StructuredCol Issue

* Created load and save code from structuredprofiler

* intermidiate commit for fixing structured profile

* test fix

* mypy fixes for typing issues

* fix for none case of the datalabler in options

* Added mock of datalabeler to structured profile test

* Added tests for encoding of the Structured profiler

* Update dataprofiler/profilers/json_decoder.py

Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com>

* Update dataprofiler/profilers/profile_builder.py

Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com>

* Update dataprofiler/profilers/profiler_options.py

Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com>

* Pr fixes

* Fixed typo in test

* Update dataprofiler/profilers/json_decoder.py

Co-authored-by: Taylor Turner <taylorfturner@gmail.com>

* Update dataprofiler/profilers/profile_builder.py

Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com>

* Update dataprofiler/tests/profilers/utils.py

Co-authored-by: Taylor Turner <taylorfturner@gmail.com>

* Update dataprofiler/profilers/profile_builder.py

Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com>

* Fixes for unneeeded callout for _profile check

* small change

---------

Co-authored-by: Jeremy Goodsitt <jeremy.goodsitt@gmail.com>
Co-authored-by: taylorfturner <taylorfturner@gmail.com>
Co-authored-by: ksneab7 <ksneab7@gmail.com>
Co-authored-by: ksneab7 <91956551+ksneab7@users.noreply.github.com>

* Added testing for values for test_json_decode_after_update (#915)

* Reuse passed labeler (#924)

* refactor: loading labeler for reuse and abstract loading

* refactor: use for DataLabelerColumn as well

* fix: don't error if doesn't exist

* refactor: allow for config dict to be passed entire way

* fix: compiler tests

* fix: structCol tests

* fix: test

* BaseProfiler save() for json (#923)

* added save for top level and tests

* small refactor

* small fix

* refactor: use seed for sample for consistency (#927)

* refactor: use seed for sample for consistency

* fix: formatting and variables

* WIP top level load (#925)

* quick hot fix for input validation on save() save_metho (#931)

* BaseProfiler: `load_method` hotfix (#932)

* added load_method

* updated tests

* fix: null_rep mat should calculate even if datetime (#933)

* Notebook Example save/load Profile (#930)

* update example data profiler demo save/load

* update notebook cells

* Update examples/data_profiler_demo.ipynb

* Update examples/data_profiler_demo.ipynb

* fix: order bug (#939)

* fix: typo on rebase

* fix: typing and bugs from rebase

* fix: options tests due to merge and loading new options

---------

Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com>
Co-authored-by: ksneab7 <91956551+ksneab7@users.noreply.github.com>
Co-authored-by: Taylor Turner <taylorfturner@gmail.com>
Co-authored-by: Tyler <tfarnan@ucsd.edu>
Co-authored-by: Junho Lee <53921230+junholee6a@users.noreply.github.com>
Co-authored-by: ksneab7 <ksneab7@gmail.com>

* Hotfix: fix post feature serialization merge (#942)

* fix: to use config instead of options

* fix: comment

* fix: maxdiff

* version bump (#944)

---------

Co-authored-by: JGSweets <JGSweets@users.noreply.github.com>
Co-authored-by: Rushabh Vinchhi <rushabhuvinchhi@gmail.com>
Co-authored-by: Richard Bann <richard@bann.com>
Co-authored-by: Liz Smith <liz.smith@richmond.edu>
Co-authored-by: Richard Bann <87214439+drahc1R@users.noreply.github.com>
Co-authored-by: Tyler <tfarnan@ucsd.edu>
Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com>
Co-authored-by: ksneab7 <91956551+ksneab7@users.noreply.github.com>
Co-authored-by: Junho Lee <53921230+junholee6a@users.noreply.github.com>
Co-authored-by: ksneab7 <ksneab7@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Medium Priority Significant improvement or bug / feature reducing overall performance Refactor Code that is being modified to improve the library

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants