Skip to content

Add variant distribution constraint #136

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 12, 2023
Merged

Add variant distribution constraint #136

merged 4 commits into from
May 12, 2023

Conversation

0xbe7a
Copy link

@0xbe7a 0xbe7a commented May 11, 2023

This PR adds a VariantDistributionConstraint which checks if the distribution of values in a column falls within the specified minimum and maximum bounds.

@codecov
Copy link

codecov bot commented May 11, 2023

Codecov Report

Merging #136 (c12b4c9) into main (c7f5a4b) will not change coverage.
The diff coverage is 60.00%.

@@           Coverage Diff           @@
##             main     #136   +/-   ##
=======================================
  Coverage   36.44%   36.44%           
=======================================
  Files          15       15           
  Lines        1723     1723           
=======================================
  Hits          628      628           
  Misses       1095     1095           
Impacted Files Coverage Δ
src/datajudge/requirements.py 50.51% <57.14%> (ø)
src/datajudge/constraints/uniques.py 33.05% <100.00%> (ø)

@YYYasin19
Copy link
Contributor

I like this a lot! 🚀

To give a concrete example that might highlight how this could be used:
In our project, we have a timestamp column representing the birthdate of a person. When reading in data, this value gets messed up sometimes. This results in a lot of outliers (e.g. people are suddenly born in 1762) but also a lot of misinterpreted data which causes the values to all be around ~1970.

With this test, we could quantize all timestamps to their respective year resulting in a categorical column with values in [1900, 1901, ..., 2023] and then check that each category only takes up at most 5% of the column -- preventing accumulations.

Copy link
Collaborator

@kklein kklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really great work!

self,
ref: DataReference,
distribution: Dict[T, Tuple[float, float]],
default_bounds: Tuple[float, float] = (0, 0),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you think of a relative violation tolerance parameter? E.g. it could say:

A test succeeds iff
#observations outside of the specified ranges / #observations <= tolerance_parameter

I don't consider it a must - we've simply faren well with tolerances historically.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If 'A' is expected to have a target share ranging from 5% to 15%, but its actual share is 16%, would you consider the 16% to be a violation of the target range or merely 1% above the upper limit?

Copy link
Author

@0xbe7a 0xbe7a May 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this feature

Copy link
Collaborator

@kklein kklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great - thanks a bunch! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants