Add variant distribution constraint #136

0xbe7a · 2023-05-11T07:44:05Z

This PR adds a VariantDistributionConstraint which checks if the distribution of values in a column falls within the specified minimum and maximum bounds.

codecov · 2023-05-11T07:50:29Z

Codecov Report

Merging #136 (c12b4c9) into main (c7f5a4b) will not change coverage.
The diff coverage is 60.00%.

@@           Coverage Diff           @@
##             main     #136   +/-   ##
=======================================
  Coverage   36.44%   36.44%           
=======================================
  Files          15       15           
  Lines        1723     1723           
=======================================
  Hits          628      628           
  Misses       1095     1095

Impacted Files	Coverage Δ
src/datajudge/requirements.py	`50.51% <57.14%> (ø)`
src/datajudge/constraints/uniques.py	`33.05% <100.00%> (ø)`

src/datajudge/constraints/uniques.py

YYYasin19 · 2023-05-11T16:02:57Z

I like this a lot! 🚀

To give a concrete example that might highlight how this could be used:
In our project, we have a timestamp column representing the birthdate of a person. When reading in data, this value gets messed up sometimes. This results in a lot of outliers (e.g. people are suddenly born in 1762) but also a lot of misinterpreted data which causes the values to all be around ~1970.

With this test, we could quantize all timestamps to their respective year resulting in a categorical column with values in [1900, 1901, ..., 2023] and then check that each category only takes up at most 5% of the column -- preventing accumulations.

kklein

Really great work!

src/datajudge/constraints/uniques.py

src/datajudge/requirements.py

src/datajudge/constraints/uniques.py

kklein · 2023-05-11T17:29:58Z

src/datajudge/constraints/uniques.py

+        self,
+        ref: DataReference,
+        distribution: Dict[T, Tuple[float, float]],
+        default_bounds: Tuple[float, float] = (0, 0),


What would you think of a relative violation tolerance parameter? E.g. it could say:

A test succeeds iff
#observations outside of the specified ranges / #observations <= tolerance_parameter

I don't consider it a must - we've simply faren well with tolerances historically.

If 'A' is expected to have a target share ranging from 5% to 15%, but its actual share is 16%, would you consider the 16% to be a violation of the target range or merely 1% above the upper limit?

I added this feature

src/datajudge/constraints/uniques.py

tests/integration/test_integration.py

kklein

Looks great - thanks a bunch! :)

Add variant distribution constraint

e69ef8a

add support to specify default bounds

e65557d

YYYasin19 reviewed May 11, 2023

View reviewed changes

src/datajudge/constraints/uniques.py Outdated Show resolved Hide resolved

kklein reviewed May 11, 2023

View reviewed changes

Apply feedback from review

8887ece

kklein reviewed May 12, 2023

View reviewed changes

tests/integration/test_integration.py Outdated Show resolved Hide resolved

add support to tolerate some degree of violation

c12b4c9

kklein approved these changes May 12, 2023

View reviewed changes

0xbe7a merged commit 5ba804f into Quantco:main May 12, 2023

0xbe7a added the snowflake label May 12, 2023

kklein mentioned this pull request May 22, 2023

Question/Feature Request: Check distribution of unique values in column #87

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add variant distribution constraint #136

Add variant distribution constraint #136

Uh oh!

0xbe7a commented May 11, 2023

Uh oh!

codecov bot commented May 11, 2023 •

edited

Loading

Uh oh!

Uh oh!

YYYasin19 commented May 11, 2023

Uh oh!

kklein left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kklein May 11, 2023

Uh oh!

0xbe7a May 11, 2023

Uh oh!

0xbe7a May 12, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kklein left a comment

Uh oh!

Uh oh!

Add variant distribution constraint #136

Add variant distribution constraint #136

Uh oh!

Conversation

0xbe7a commented May 11, 2023

Uh oh!

codecov bot commented May 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

YYYasin19 commented May 11, 2023

Uh oh!

kklein left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kklein May 11, 2023

Choose a reason for hiding this comment

Uh oh!

0xbe7a May 11, 2023

Choose a reason for hiding this comment

Uh oh!

0xbe7a May 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kklein left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented May 11, 2023 •

edited

Loading

0xbe7a May 12, 2023 •

edited

Loading