Skip to content

[SPIKE] Investigate whether Woodwork can be expanded to handle incoming string dtypes #1617

Open
@ParthivNaresh

Description

@ParthivNaresh

Currently, if a user creates a Pandas dataframe and passes it into Woodwork, certain dtypes are already inferred in Pandas which makes inference significantly easier. However there might be cases where all incoming data is in the form of text and has a dtype of string.

For a dataframe initialized like this:

df = pd.DataFrame()
df["ints"] = [i for i in range(100)]
df["floats"] = [i*1.1 for i in range(100)]
df["bools"] = [True, False, False, True, False] * 20
df["bools_nan"] = [True, False, False, True, pd.NA] * 20
df["strings"] = [f"{i}" for i in range(100)]
df["categoricals"] = np.random.choice(["Yellow", "Blue", "Red"], 100)

Subsequent Woodwork initialization yields as expected:
Screen Shot 2023-01-13 at 4 03 12 PM

But conversion of all dtypes to string prior to Woodwork initialization

for col in df.columns:
    df[col] = df[col].astype("string")

Yields this:
Screen Shot 2023-01-13 at 4 03 21 PM

This spike covers investigation into what solution(s) exist for this and how/in what order it should be tackled (by logical type, or is there an approach that can tackle all at once).

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions