Skip to content

ENH: Apply function on a column easily, maintaining fluent interface #38229

Closed
@tdamsma

Description

@tdamsma

I really like the pandas fluent / method chaining interface but it is not always convenient to use. I often end up writing code like so:

import pandas as pd
import datetime as dt

df = pd.DataFrame([["05SEP2014", "a"]], columns=["date", "other_col"])

def fix_date_a(df):
    df['date'] = df["date"].apply(lambda x: dt.datetime.strptime(x, "%d%b%Y"))
    return df

# of course this example could have been vectorized much better like so

def fix_date_b():
    df['date'] = pd.to_datetime(df['date'], format='%d%b%Y')
    return df

This is ok, but I would much rather write this in a fluent style. As far as I am aware this is only possible with .assign and .pipe, and would give something like so:

from functools import partial
def element_wise_date_parser(x):
    return dt.datetime.strptime(x, "%d%b%Y")
vectorized_date_parser = partial(pd.to_datetime, format='%d%b%Y')


def fix_date_c():
    return df.assign(date=lambda x: x["date"].apply(element_wise_date_parser))

def fix_date_d():
    return df.assign(date=lambda x: vectorized_date_parser(x["date"]))

I find that syntax not so easy to read and write. And as I find myself needing to perform an operation on a single column of a dataframe quite often, I would like to have a better way to do that. I propose add an apply_to method for the element wise which I made an example monkeypatch implementation:

def apply_to(self, column_name, function):
    return self.assign(**{column_name: lambda x: x[column_name].apply(function)})

pd.DataFrame.apply_to = apply_to

def fix_date_d(df):
    return df.apply_to("date", element_wise_date_parser)

# not sure what a column wise function could be named, lets say operate_on
def operate_on(self, column_name, function):
    return self.assign(**{column_name: lambda x: function(x[column_name])})

pd.DataFrame.operate_on= operate_on

def fix_date_e(df):
    return df.operate_on("date", vectorized_date_parser)

With these examples I find it much is easier to in-place modify a column without breaking the fluent interface. Of course adding even more methods to the already broad dataframe API is not free so am not 100% sure this is a good idea. But I wanted to put it up here anyway as often see myself and others cluttering code with unnecessary intermediate dataframes and or repeatedly reassigning df due to not knowing how to keep the fluent interface. And sprinkling around "assign with lambdas" everywhere is also not that appealing

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementNeeds TriageIssue that has not been reviewed by a pandas team member

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions