Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrectly enforced rounding on numerical/float data columns #1039

Closed
dionman opened this issue Sep 28, 2022 · 3 comments
Closed

Incorrectly enforced rounding on numerical/float data columns #1039

dionman opened this issue Sep 28, 2022 · 3 comments
Labels
bug Something isn't working data:single-table Related to tabular datasets resolution:resolved The issue was fixed, the question was answered, etc.

Comments

@dionman
Copy link

dionman commented Sep 28, 2022

Error Description

I’m fitting CT-GAN and TVAE on the attached halfmoon dataset (using the default parameters). When sampling from the model I get the float variables discretised to the closest integer. Have you observed behaviour like this in any other settings? It seems to me it’s probably due to some transformation. (I’m using the fit_sample() function as defined in the SDGym repo to fit the model and sample from it). On the other hand, I can get sane output if I instead use fit() and sample() as implemented in ctgan.synthesizers.ctgan.CTGANSynthesizer()
halfmoon.zip

@dionman dionman added bug Something isn't working new Automatic label applied to new issues labels Sep 28, 2022
@npatki
Copy link
Contributor

npatki commented Sep 28, 2022

Hi @dionman thanks so much for filing this and providing the data/metadata. Confirmed that I can replicate.

The issue appears to be in how we're learning & enforcing rounding. Only the wrappers in sdv have this feature, so the underlying ML model in ctgan library is unaffected.

From some of my own experiments, I've found the following:

  1. There is a learn_rounding_scheme parameter when creating the model. But turning it to False does not help.
  2. The rounding is being done by the RDT library right now. There seems to be an issue when the # of digits >14 (see issue)

Workaround

If it doesn't affect quality too much, I'd suggest just rounding your data to the first 14 digits before training the model.

rounded_data = data.round(14)
model.fit(rounded_data)

Fixes

The RDT library should be fixed, we actually plan to stop using RDT for rounding in future SDV releases. We should continue to keep this issue open until we verify a fix.

Separately, I'm not sure why learn_rounding_scheme cannot be turned off right now. I'll file another issue to track this.

@npatki npatki added under discussion Issue is currently being discussed data:single-table Related to tabular datasets and removed new Automatic label applied to new issues labels Sep 28, 2022
@dionman dionman closed this as completed Sep 28, 2022
@npatki
Copy link
Contributor

npatki commented Sep 29, 2022

Let's keep this one open until we fix & verify the underlying bug all the way throughout the SDV.

@npatki
Copy link
Contributor

npatki commented Mar 10, 2023

Hi everyone, great news! This issue has now been resolved in the new, SDV 1.0 (Beta!) release.

Fore more information and to get started, see the SDV 1.0 demos.

@npatki npatki closed this as completed Mar 10, 2023
@npatki npatki added resolution:resolved The issue was fixed, the question was answered, etc. SDV 1.0 (Beta!) labels Mar 10, 2023
garrgravarr pushed a commit to dieterich-lab/ASyH that referenced this issue May 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data:single-table Related to tabular datasets resolution:resolved The issue was fixed, the question was answered, etc.
Projects
None yet
Development

No branches or pull requests

2 participants