Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Choose a different seed for each transformer #619

Closed
npatki opened this issue Mar 13, 2023 · 2 comments · Fixed by #622
Closed

Choose a different seed for each transformer #619

npatki opened this issue Mar 13, 2023 · 2 comments · Fixed by #622
Assignees
Labels
feature request Request for a new feature
Milestone

Comments

@npatki
Copy link
Contributor

npatki commented Mar 13, 2023

Problem Description

If I have two columns that are both assigned to AnonymizedFaker (eg. for uuid4), then the reverse transform always produces the same exact values for both columns. This is unexpected, as I'd want each uuid column to have different values.

image

See SDV issue 1303.

Expected behavior

We should change the behavior for the AnonymizedFaker and PseudoAnonymizedFaker transformers.

  1. When first initializing the HyperTransformer, we should set a different seed for each of these transformers. (The seed values should be deterministic. Eg. The transformer for column1 is always set to seed=37.) As a result, these transformers will have a different seed. Each transformer should separately store its seed.
  2. When resetting the HyperTransformer, these transformer should set their own individual seeds to what was stored in step (1)
@npatki npatki added the feature request Request for a new feature label Mar 13, 2023
@npatki npatki changed the title Choose a different seed every time the a transformer is initialized Choose a different seed for each transformer Mar 13, 2023
@npatki
Copy link
Contributor Author

npatki commented Mar 14, 2023

Implementation suggestion: Set the seed based on the column name. (Use a hash)

@npatki
Copy link
Contributor Author

npatki commented Mar 21, 2023

@fealho Another suggestion: Instead of just using the column name, create a hash using a combination of the column name and some value(s) in the data itself.

Rationale: Different tables may have the same column name. Eg. there can be an address column in table vendors and also customers. The seed should not be the same for both columns -- that would lead to some unexpected results of vendors and customers having the same addresses!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants