Description
I really like the pandas fluent / method chaining interface but it is not always convenient to use. I often end up writing code like so:
import pandas as pd
import datetime as dt
df = pd.DataFrame([["05SEP2014", "a"]], columns=["date", "other_col"])
def fix_date_a(df):
df['date'] = df["date"].apply(lambda x: dt.datetime.strptime(x, "%d%b%Y"))
return df
# of course this example could have been vectorized much better like so
def fix_date_b():
df['date'] = pd.to_datetime(df['date'], format='%d%b%Y')
return df
This is ok, but I would much rather write this in a fluent style. As far as I am aware this is only possible with .assign
and .pipe
, and would give something like so:
from functools import partial
def element_wise_date_parser(x):
return dt.datetime.strptime(x, "%d%b%Y")
vectorized_date_parser = partial(pd.to_datetime, format='%d%b%Y')
def fix_date_c():
return df.assign(date=lambda x: x["date"].apply(element_wise_date_parser))
def fix_date_d():
return df.assign(date=lambda x: vectorized_date_parser(x["date"]))
I find that syntax not so easy to read and write. And as I find myself needing to perform an operation on a single column of a dataframe quite often, I would like to have a better way to do that. I propose add an apply_to method for the element wise which I made an example monkeypatch implementation:
def apply_to(self, column_name, function):
return self.assign(**{column_name: lambda x: x[column_name].apply(function)})
pd.DataFrame.apply_to = apply_to
def fix_date_d(df):
return df.apply_to("date", element_wise_date_parser)
# not sure what a column wise function could be named, lets say operate_on
def operate_on(self, column_name, function):
return self.assign(**{column_name: lambda x: function(x[column_name])})
pd.DataFrame.operate_on= operate_on
def fix_date_e(df):
return df.operate_on("date", vectorized_date_parser)
With these examples I find it much is easier to in-place modify a column without breaking the fluent interface. Of course adding even more methods to the already broad dataframe API is not free so am not 100% sure this is a good idea. But I wanted to put it up here anyway as often see myself and others cluttering code with unnecessary intermediate dataframes and or repeatedly reassigning df
due to not knowing how to keep the fluent interface. And sprinkling around "assign with lambdas" everywhere is also not that appealing