Closed
Description
Description
In metrics/eval.py, each dataset (e.g. X_gt, X_syn) is encoded separately. This is problematic, as this fits separate sklearn.preprocessing.LabelEncoder's. This results in unexpected behaviour if the unique elements for each column are not identical for X_gt, X_syn, as in this case the encoding of X_gt does not denote the same variable as in X_syn.
How to Reproduce
from sklearn.preprocessing import LabelEncoder
df_real = LabelEncoder.fit_transform(pd.DataFrame(["0","1", "2"])[0])
>>> [0,1,2]
df_syn = LabelEncoder.fit_transform(pd.DataFrame(["1","2", "2"])[0])
>>> [0,1,1]
Expected Behavior
Evidently, above we want the processed df_syn to be [1,2,2].
Fix
Seems like we can just get the encoders when calling X_gt.encode(), and pass this to all other encode calls.
Metadata
Metadata
Assignees
Labels
No labels
Activity