Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata Detection Fails with new Data Type #2182

Closed
pvk-developer opened this issue Aug 12, 2024 · 0 comments · Fixed by #2184
Closed

Metadata Detection Fails with new Data Type #2182

pvk-developer opened this issue Aug 12, 2024 · 0 comments · Fixed by #2184
Assignees
Labels
bug Something isn't working
Milestone

Comments

@pvk-developer
Copy link
Member

Error Description

Hardcoded logic in our software causes metadata.detect_from_dataframe to fail in detecting new data types beyond the primitive ones. There is code that focuses only on primitive data types, ignoring others and raises an error that we only support those.

Steps to reproduce

import pandas as pd
data = pd.DataFrame({
    'UInt8': pd.Series([1, 2, None, 3], dtype='UInt8'),
    'UInt16': pd.Series([1, 2, None, 3], dtype='UInt16'),
    'UInt32': pd.Series([1, 2, None, 3], dtype='UInt32'),
    'UInt64': pd.Series([1, 2, None, 3], dtype='UInt64'),
})
from sdv.metadata import SingleTableMetadata
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data)

sdv.metadata.errors.InvalidMetadataError: Unsupported data type for column 'UInt8' (kind: u).The valid data types are: 'object', 'int', 'float', 'datetime', 'bool'.

Expected behavior

We should be able to detect the following data types:

  • UInt(8, 16, 32, 64) -- Pandas
  • uint(8, 16, 32, 64) -- Numpy

Additional Context

This is the current code that we use to detect data types. This has to be enhanced to support more data types.

if dtype in self._DTYPES_TO_SDTYPES:
    sdtype = self._DTYPES_TO_SDTYPES[dtype]
elif dtype in ['i', 'f']:
    sdtype = self._determine_sdtype_for_numbers(column_data)

elif dtype == 'O':
    sdtype = self._determine_sdtype_for_objects(column_data)

if sdtype is None:
    raise InvalidMetadataError(
        f"Unsupported data type for column '{field}' (kind: {dtype})."
        "The valid data types are: 'object', 'int', 'float', 'datetime', 'bool'."
    )

In order to support a wider range of dtypes, we have to enhance this logic by adding more dtype mappings. With the given configuration, we are not supporting Pandas nullable ints and other data types that are not the primitive ones loaded with a numpy backend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants