Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gaussian Copula is generating different data with metadata and without metadata. #576

Closed
kvrameshreddy opened this issue Aug 30, 2021 · 3 comments · Fixed by #599
Closed
Assignees
Labels
data:single-table Related to tabular datasets question General question about the software
Milestone

Comments

@kvrameshreddy
Copy link

kvrameshreddy commented Aug 30, 2021

When we create a Gaussian copula model without defining metadata explicitly, the generated synthetic data holds the properties of the sample data( like min and max values). When we define the same by explicitly passing meta data the generated data is loosing the properties.

from sdv.tabular import GaussianCopula

df1=pd.read_csv("rounding_data_test.csv")
meta_default={'fields': {'COL1': {'type': 'numerical',
                                  'subtype': 'float'} ,
                         'COL2': {'type': 'numerical',
                                   'subtype': 'float'},
                          'COL3': {'type': 'numerical',
                                    'subtype': 'float'},
                          'COL4': {'type': 'numerical',
                                    'subtype': 'float'}}}
model1 = GaussianCopula(table_metadata=meta_default)
model1.fit(df1)
new_data_with_meta = model1.sample(100)

''' without meta data'''

model2 = GaussianCopula()
model2.fit(df1)

new_data_without_meta = model2.sample(100)

rounding_data_test.csv

gen_data_issue
From the above screenshot consider the 'col2' column in the original data(df1) min value is 8.37 and max is 196.8 .
now the generated new_data_without_meta holds the min and max values of the sample but the new_data_with_meta generates negative values. I even verified for the distributions of the data for each column, both the models have same distributions.
data_dist

The same behaviour is seen with constraints also, when passed directly gives similar data, when passed with explicit metadata gives out of bound values.
can you please look into this issue.
Thankyou.

@npatki npatki added question General question about the software data:single-table Related to tabular datasets labels Sep 20, 2021
@katxiao
Copy link
Contributor

katxiao commented Sep 21, 2021

Hi @kvrameshreddy, thanks for your question!

I agree that we should expect min and max values to be automatically detected the same way in both cases (when metadata is provided and when it is not). We will look into this issue soon.

@kvrameshreddy
Copy link
Author

kvrameshreddy commented Oct 12, 2021

Hi @katxiao , thanks for looking into this issue, When can we expect 0.12.1 release ?

Thankyou

@katxiao
Copy link
Contributor

katxiao commented Oct 12, 2021

Hi @kvrameshreddy, it will go out this week!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data:single-table Related to tabular datasets question General question about the software
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants