Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Unique Constraint #532

Closed
amontanez24 opened this issue Jul 29, 2021 · 7 comments · Fixed by #540
Closed

Create Unique Constraint #532

amontanez24 opened this issue Jul 29, 2021 · 7 comments · Fixed by #540
Assignees
Labels
feature request Request for a new feature
Milestone

Comments

@amontanez24
Copy link
Contributor

Problem Description

Sometimes, even though a column isn't a key, the values in it need to be unique in order for the data to be valid. Consider the following 2 use cases:

  1. There is a sensitive field in my data (such as SSN) that cannot be repeated. Even though I have another primary key (user_id), I need to makes sure that SSN is also unique
  2. I have a table that lists all the shoe types I sell. I have models (eg. Nike, Reebok, etc.) and sizes (6, 6.5, 7, etc.). I to make sure that each size is unique within a given model. Eg. I can't have 2 rows that are size 7.5 Reebok

In order to support this, write a Unique constraint where the user can supply both a column (whose values must be unique) and an optional list of group_by columns that control the partition that determine uniqueness. In our cases

  1. column='SSN' with group_by=None -- which means it's unique throughout the table
  2. column='size' while group_by=['model'] -- which means that within a given model, size is unique
@amontanez24 amontanez24 added the feature request Request for a new feature label Jul 29, 2021
@csala
Copy link
Contributor

csala commented Jul 29, 2021

This sounds good @amontanez24 but I would suggest one change in the proposal: Instead of having a column argument, I would have a columns argument so that it is possible to ensure that the combination of multiple columns stays unique within the table.

@npatki
Copy link
Contributor

npatki commented Jul 30, 2021

It should also be possible for Unique and UniqueCombinations to work with each other that way.

Eg. a city, country pair is a UniqueCombination but maybe in my dataset, I don't want each pair to appear more than once -- as in, there should only be 1 row for Paris, France

@csala
Copy link
Contributor

csala commented Jul 30, 2021

If Unique is implemented using reject sampling, the two should be able to be combined without issues. The only thing that will matter is the order: If Unique comes after UniqueCombinations, and if UniqueCombinations is using the transform strategy, it will fail because Unique will not find the corresponding columns. If Unique is passed before UniqueCombinations, everything should work:

from sdv.constraints import Unique, UniqueCombinations

constraints = [
    Unique(columns=['country', 'city']),
    UniqueCombinations(columns=['country', 'city']),
]
my_model = MyTabularModel(constraints=constraints)

@npatki
Copy link
Contributor

npatki commented Jul 30, 2021

Could we consider having a extra_combinations kwarg so that nobody runs into this case?

from sdv.constraints import Unique


cons = Unique(
    columns=['country', 'city'], # a country, city pair can only appear once
    extra_combinations=False # don't make extra combinations outside of the original data
)

my_model = MyTabularModel(constraints=[cons])

@csala
Copy link
Contributor

csala commented Jul 30, 2021

I would not do it, for multiple reasons:

  1. Each Constraint should have only one purpose, otherwise it will become too complex and confusing.
  2. Since UniqueCombinations would still exist, someone who is familiar with it (and maybe not so much with Unique) could still define both Unique and UniqueCombinations in the wrong order and hit the error anyway.
  3. The UniqueCombinations implementation is already complex on its own, so having Unique cover the same functionality would introduce additional complexity that may end up in maintenance overhead and potential errors.

So, altogether, I would rather say that we should cover this in a separate issue about making the constraints robust enough (or have enough validations) so that incompatible ordering is not possible, either because using the wrong order raises a helpful error message from which the user can learn how to sort the constraints properly, or because SDV is able to internally figure out the right order and apply it.

@npatki
Copy link
Contributor

npatki commented Jul 30, 2021

+1 to a new issue about making sure SDV can robustly handle multiple constraints. This seems like a future project. For the current scope -- a descriptive error message would be nice.

Food for thought:

Each Constraint should have only one purpose, otherwise it will become too complex and confusing.

This isn't 100% true right now. For eg, the GreaterThan constraint can be used for two purposes (scalar comparison or column comparison) and the Between constraint is essentially a shortcut for two GreaterThan constraints.

I actually like this setup. To me, it's simple to have 1 constraint per problem being encountered. Together, the constraint need not necessarily be mutually exclusive primitives.

@csala
Copy link
Contributor

csala commented Aug 3, 2021

I see that @amontanez24 already opened issue #541 to sort out the problem of combining multiple constraints that affect the same columns, so here we can stick to the single Unique purpose of ensuring unique values.

Also, it seems like the group_by argument stops being necessary once multiple columns can be specified, so we can drop it from the specification.

@csala csala added the approved label Aug 10, 2021
@csala csala added this to the 0.12.0 milestone Aug 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants