Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError when datetime column is all null (Cannot cast DatetimeArray to dtype float64) #1466

Closed
npatki opened this issue Jun 12, 2023 · 2 comments
Labels
bug Something isn't working resolution:resolved The issue was fixed, the question was answered, etc.

Comments

@npatki
Copy link
Contributor

npatki commented Jun 12, 2023

Environment Details

  • SDV version: 1.2.0
  • Python version: Any
  • Operating System: Any

Error Description

Sometimes I may have a datetime column that contains all NULL values (see RDT #367 for why this may happen).

If I try to input this column into the SDV, then I get an error when sampling.

Steps to reproduce

import pandas as pd
from sdv.metadata import SingleTableMetadata
from sdv.single_table import GaussianCopulaSynthesizer

metadata_dict = {
    'columns': {
        'A': { 'sdtype': 'datetime', 'datetime_format': '%Y-%m-%d'},
        'B': { 'sdtype': 'numerical' }
    }
}

metadata = SingleTableMetadata.load_from_dict(metadata_dict)
data_nulls = pd.DataFrame(data={
    'A': [np.nan, np.nan, np.nan, np.nan, np.nan],
    'B': [23.12, 59.12, 10.00, 12.01, 10.11]
})

synth = GaussianCopulaSynthesizer(metadata)
synth.fit(data_nulls)
synth.sample(5)

Output:

TypeError: Cannot cast DatetimeArray to dtype float64

Additional Context

  • The sampling works if the column contains all numerical values
  • The _sample method returns a column full of 0.0 values (as intended)
  • The RDT HyperTransformer works fine for forward and reverse transform (see below). My guess is that since the values are reversed to pd.NaT, the SDV is erroring because it wants to convert them into floats somehow?

image

@npatki npatki added the bug Something isn't working label Jun 12, 2023
@npatki
Copy link
Contributor Author

npatki commented Jun 22, 2023

Root Cause: Even though the input data has NaN values, the RDT is providing NaT values during sampling. This will be fixed by RDT #657

Workaround

For now, you can workaround this issue by converting the input data to NaT values instead of leaving them as NaN.

import pandas as pd

# this converts the NaN values to NaT values
data_nulls['A'] = pd.to_datetime(data_nulls['A'])

# now you can model and sample
synth.fit(data_nulls)
synth.sample(5)

@npatki
Copy link
Contributor Author

npatki commented Mar 5, 2024

This has been fixed.

@npatki npatki closed this as completed Mar 5, 2024
@npatki npatki added the resolution:resolved The issue was fixed, the question was answered, etc. label Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working resolution:resolved The issue was fixed, the question was answered, etc.
Projects
None yet
Development

No branches or pull requests

1 participant