-
Notifications
You must be signed in to change notification settings - Fork 379
Description
Problem Description
In certain cases, it is possible to have a schema in which a column is both (a) a primary key of the table, and (b) a foreign key linking to a different table at the same time. This generally happens when there is a "0 or 1" linkage between two tables.
For example, consider a table Users that contains all users of a service:
User ID | Date Joined | Subscription Type |
---|---|---|
id_03234 | June 2, 2024 | Normal |
id_93813 | Apr 4, 2023 | Normal |
id_9192 | Dec 7, 2024 | Premium |
... | ... | ... |
Then consider another table Survey Responses that contains optional survey responses submitted by those users. Not all users filled out the survey, and the ones who did could only submit it 1 time.
User ID | Age | Satisfaction |
---|---|---|
id_03234 | 43 | Very Satisfied |
id_9192 | 23 | Slightly Satisfied |
In this sense, the Users the Survey Responses have a "0 or 1" mapping.
In this schema, the Survey Responses.User ID
column is special:
- It is the primary key of the Survey Responses table, as it is uniquely identifying each response
- It is also the foreign key into
Users.User ID
because it is referring to the same concept of a user
However, the metadata definition won't currently allow me to specify the column as both a primary and foreign key. The metadata validation fails:
InvalidMetadataError: The metadata is not valid
Relationships:
Invalid relationship between table 'Users' and table 'Survey Responses'. A relationship must connect a primary key with a non-primary key.
Expected behavior
I expect that I should be able to add such a relationship to my metadata and that the synthesizers would create valid synthetic data ensuring that:
- The values in the column are all unique (because it's a primary key) and
- The values in the column are referencing values in the parent table's primary key too (because it's a foreign key)
Workaround
The simple workaround for anyone running into this is: Only supply the foreign key designation in the metadata for now (and leave out the primary key designation).
In this example: Do not designate that Survey Responses.User ID
is a primary key, but do designate it as a foreign key into Users.User ID
.
This will create fully valid synthetic data that contains fully unique values for Survey Responses.User ID
. Even though it's not designated as a primary key, all SDV synthesizers respect the cardinality logic so the synthetic data will be valid.