Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cast column to datetime if specified in field transformers #321

Closed
amontanez24 opened this issue Nov 1, 2021 · 0 comments · Fixed by #331
Closed

Cast column to datetime if specified in field transformers #321

amontanez24 opened this issue Nov 1, 2021 · 0 comments · Fixed by #331
Assignees
Labels
feature request Request for a new feature
Milestone

Comments

@amontanez24
Copy link
Contributor

Problem Description

If a user specifies that a column should use the DatetimeTransformer, we should attempt to cast the column as datetime if it isn't already. Sometimes when data is loaded in pandas, datetime columns have a dtype of object and the DatetimeTransformer doesn't work on it.

Expected behavior

Here is an example using the student_placements dataset.

from rdt.transformers import DatetimeTransformer

ht = HyperTransformer()
students = pd.read_csv('student_placements/student_placements.csv')
students
degree_type  work_experience  experience_years  employability_perc  \
0      Sci&Tech            False                 0                55.0   
1      Sci&Tech             True                 1                86.5   
2     Comm&Mgmt            False                 0                75.0   
3      Sci&Tech            False                 0                66.0   
4     Comm&Mgmt            False                 0                96.8   
..          ...              ...               ...                 ...   
210   Comm&Mgmt            False                 0                91.0   
211    Sci&Tech            False                 0                74.0   
212   Comm&Mgmt             True                 1                59.0   
213   Comm&Mgmt            False                 0                70.0   
214   Comm&Mgmt            False                 0                89.0   

    mba_spec  mba_perc   salary  placed start_date   end_date  duration  
0     Mkt&HR     58.80  27000.0    True 2020-07-23 2020-10-12       3.0  
1    Mkt&Fin     66.28  20000.0    True 2020-01-11 2020-04-09       3.0  
2    Mkt&Fin     57.80  25000.0    True 2020-01-26 2020-07-13       6.0  
3     Mkt&HR     59.43      NaN   False        NaT        NaT       NaN  
4    Mkt&Fin     55.50  42500.0    True 2020-07-04 2020-09-27       3.0  
..       ...       ...      ...     ...        ...        ...       ...  
210  Mkt&Fin     74.49  40000.0    True 2020-07-27 2020-10-20       3.0  
211  Mkt&Fin     53.62  27500.0    True 2020-01-23 2020-08-04       6.0  
212  Mkt&Fin     69.72  29500.0    True 2020-01-25 2020-08-05       6.0  
213   Mkt&HR     60.23  20400.0    True 2020-01-19 2020-04-20       3.0  
214   Mkt&HR     60.22      NaN   False        NaT        NaT       NaN  
ht.set_first_transformers_for_fields({'start_date': DatetimeTransformer, 'end_date': DatetimeTransformer})
transformed = ht.transform(students)
transformed
     start_date.value  start_date.is_null  end_date.value  end_date.is_null  \
0                 0.0                 1.0             0.0               1.0   
1                 0.0                 1.0             0.0               1.0   
2                 0.0                 1.0             0.0               1.0   
3                 0.0                 1.0             0.0               1.0   
4                 0.0                 1.0             0.0               1.0   
..                ...                 ...             ...               ...   
210               0.0                 1.0             0.0               1.0   
211               0.0                 1.0             0.0               1.0   
212               0.0                 1.0             0.0               1.0   
213               0.0                 1.0             0.0               1.0   
214               0.0                 1.0             0.0               1.0   

As you can see, currently all the date values are being read as null and the transformer doesn't work, even though the values in the data could definitely be cast as datetime.

The desired result would be as follows:

start_date.value  start_date.is_null  end_date.value  end_date.is_null  \
0        1.595462e+18                 0.0    1.602461e+18               0.0   
1        1.578701e+18                 0.0    1.586390e+18               0.0   
2        1.579997e+18                 0.0    1.594598e+18               0.0   
3        1.583637e+18                 1.0    1.597485e+18               1.0   
4        1.593821e+18                 0.0    1.601165e+18               0.0   
..                ...                 ...             ...               ...   
210      1.595808e+18                 0.0    1.603152e+18               0.0   
211      1.579738e+18                 0.0    1.596499e+18               0.0   
212      1.579910e+18                 0.0    1.596586e+18               0.0   
213      1.579392e+18                 0.0    1.587341e+18               0.0   
214      1.583637e+18                 1.0    1.597485e+18               1.0   

Additional context

This happens because the dtype for the start_date and end_date column is object by default. You can check this by running

In [1]: print(students['start_date'].dtype)
Out[1]: object
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant