Create Unique Constraint #532

amontanez24 · 2021-07-29T03:02:00Z

Problem Description

Sometimes, even though a column isn't a key, the values in it need to be unique in order for the data to be valid. Consider the following 2 use cases:

There is a sensitive field in my data (such as SSN) that cannot be repeated. Even though I have another primary key (user_id), I need to makes sure that SSN is also unique
I have a table that lists all the shoe types I sell. I have models (eg. Nike, Reebok, etc.) and sizes (6, 6.5, 7, etc.). I to make sure that each size is unique within a given model. Eg. I can't have 2 rows that are size 7.5 Reebok

In order to support this, write a Unique constraint where the user can supply both a column (whose values must be unique) and an optional list of group_by columns that control the partition that determine uniqueness. In our cases

column='SSN' with group_by=None -- which means it's unique throughout the table
column='size' while group_by=['model'] -- which means that within a given model, size is unique

The text was updated successfully, but these errors were encountered:

csala · 2021-07-29T16:58:13Z

This sounds good @amontanez24 but I would suggest one change in the proposal: Instead of having a column argument, I would have a columns argument so that it is possible to ensure that the combination of multiple columns stays unique within the table.

npatki · 2021-07-30T14:53:14Z

It should also be possible for Unique and UniqueCombinations to work with each other that way.

Eg. a city, country pair is a UniqueCombination but maybe in my dataset, I don't want each pair to appear more than once -- as in, there should only be 1 row for Paris, France

csala · 2021-07-30T16:37:30Z

If Unique is implemented using reject sampling, the two should be able to be combined without issues. The only thing that will matter is the order: If Unique comes after UniqueCombinations, and if UniqueCombinations is using the transform strategy, it will fail because Unique will not find the corresponding columns. If Unique is passed before UniqueCombinations, everything should work:

from sdv.constraints import Unique, UniqueCombinations

constraints = [
    Unique(columns=['country', 'city']),
    UniqueCombinations(columns=['country', 'city']),
]
my_model = MyTabularModel(constraints=constraints)

npatki · 2021-07-30T16:44:11Z

Could we consider having a extra_combinations kwarg so that nobody runs into this case?

from sdv.constraints import Unique


cons = Unique(
    columns=['country', 'city'], # a country, city pair can only appear once
    extra_combinations=False # don't make extra combinations outside of the original data
)

my_model = MyTabularModel(constraints=[cons])

csala · 2021-07-30T16:55:33Z

I would not do it, for multiple reasons:

Each Constraint should have only one purpose, otherwise it will become too complex and confusing.
Since UniqueCombinations would still exist, someone who is familiar with it (and maybe not so much with Unique) could still define both Unique and UniqueCombinations in the wrong order and hit the error anyway.
The UniqueCombinations implementation is already complex on its own, so having Unique cover the same functionality would introduce additional complexity that may end up in maintenance overhead and potential errors.

So, altogether, I would rather say that we should cover this in a separate issue about making the constraints robust enough (or have enough validations) so that incompatible ordering is not possible, either because using the wrong order raises a helpful error message from which the user can learn how to sort the constraints properly, or because SDV is able to internally figure out the right order and apply it.

npatki · 2021-07-30T17:14:29Z

+1 to a new issue about making sure SDV can robustly handle multiple constraints. This seems like a future project. For the current scope -- a descriptive error message would be nice.

Food for thought:

Each Constraint should have only one purpose, otherwise it will become too complex and confusing.

This isn't 100% true right now. For eg, the GreaterThan constraint can be used for two purposes (scalar comparison or column comparison) and the Between constraint is essentially a shortcut for two GreaterThan constraints.

I actually like this setup. To me, it's simple to have 1 constraint per problem being encountered. Together, the constraint need not necessarily be mutually exclusive primitives.

csala · 2021-08-03T16:54:37Z

I see that @amontanez24 already opened issue #541 to sort out the problem of combining multiple constraints that affect the same columns, so here we can stick to the single Unique purpose of ensuring unique values.

Also, it seems like the group_by argument stops being necessary once multiple columns can be specified, so we can drop it from the specification.

amontanez24 added the feature request Request for a new feature label Jul 29, 2021

amontanez24 mentioned this issue Aug 2, 2021

Create Unique Constraint #540

Merged

amontanez24 closed this as completed in #540 Aug 5, 2021

csala added the approved label Aug 10, 2021

csala assigned amontanez24 Aug 12, 2021

csala added this to the 0.12.0 milestone Aug 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create Unique Constraint #532

Create Unique Constraint #532

amontanez24 commented Jul 29, 2021

csala commented Jul 29, 2021

npatki commented Jul 30, 2021

csala commented Jul 30, 2021

npatki commented Jul 30, 2021

csala commented Jul 30, 2021 •

edited

Loading

npatki commented Jul 30, 2021

csala commented Aug 3, 2021

Create Unique Constraint #532

Create Unique Constraint #532

Comments

amontanez24 commented Jul 29, 2021

Problem Description

csala commented Jul 29, 2021

npatki commented Jul 30, 2021

csala commented Jul 30, 2021

npatki commented Jul 30, 2021

csala commented Jul 30, 2021 • edited Loading

npatki commented Jul 30, 2021

csala commented Aug 3, 2021

csala commented Jul 30, 2021 •

edited

Loading