Skip to content

eval does not share encoding transformers #250

Closed
@bvanbreugel

Description

@bvanbreugel

Description

In metrics/eval.py, each dataset (e.g. X_gt, X_syn) is encoded separately. This is problematic, as this fits separate sklearn.preprocessing.LabelEncoder's. This results in unexpected behaviour if the unique elements for each column are not identical for X_gt, X_syn, as in this case the encoding of X_gt does not denote the same variable as in X_syn.

How to Reproduce

from sklearn.preprocessing import LabelEncoder
df_real = LabelEncoder.fit_transform(pd.DataFrame(["0","1", "2"])[0])
>>> [0,1,2]
df_syn = LabelEncoder.fit_transform(pd.DataFrame(["1","2", "2"])[0])
>>> [0,1,1]

Expected Behavior

Evidently, above we want the processed df_syn to be [1,2,2].

Fix

Seems like we can just get the encoders when calling X_gt.encode(), and pass this to all other encode calls.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions