Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: Deprecate returning a DataFrame in Series.apply #52116

Closed
2 of 3 tasks
topper-123 opened this issue Mar 22, 2023 · 23 comments · Fixed by #52123
Closed
2 of 3 tasks

DEPR: Deprecate returning a DataFrame in Series.apply #52116

topper-123 opened this issue Mar 22, 2023 · 23 comments · Fixed by #52123
Labels
Apply Apply, Aggregate, Transform, Map Deprecate Functionality to remove in pandas Series Series data structure
Milestone

Comments

@topper-123
Copy link
Contributor

topper-123 commented Mar 22, 2023

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Series.apply (more precisely SeriesApply.apply_standard) normally returns a Series, if supplied a callable, but if the callable returns a Series, it converts the returned array of Series into a Dataframe and returns that.

>>> small_ser = pd.Series(range(3))
>>> small_ser.apply(lambda x: pd.Series({"x": x, "x**2": x**2}))
   x  x**2
0  1     1
1  2     4
2  3     9

This approach is exceptionally slow and should be generally avoided. For example:

import pandas as pd
>>> ser = pd.Series(range(10_000))
>>> %timeit ser.apply(lambda x: pd.Series({"x": x, "x**2": x ** 2}))
658 ms ± 623 µs per loop

The result from the above method can in almost all cases be gotten faster by using other methods. For example can the above example be written much faster as:

>>> %timeit ser.pipe(lambda x: pd.DataFrame({"x": x, "x**2": x**2}))
80.6 µs ± 682 ns per loop

In the rare case where a fast approach isn't possible, users should just construct the DataFrame manually from the list of Series instead of relying on the Series.apply method.

Also, allowing this construct complicates the return type of Series.Apply.apply_standard and therefore the apply method. By removing the option to return a DataFrame, the signature of SeriesApply.apply_standard will be simplified into simply Series.

All in all, this behavior is slow and makes things complicated without any benefit, so I propose to deprecate returning a DataFrame from SeriesApply.apply_standard.

@topper-123 topper-123 added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member Deprecate Functionality to remove in pandas Apply Apply, Aggregate, Transform, Map Series Series data structure and removed Enhancement labels Mar 22, 2023
@rhshadrach rhshadrach removed the Needs Triage Issue that has not been reviewed by a pandas team member label Apr 6, 2023
@rhshadrach rhshadrach added this to the 2.1 milestone Apr 6, 2023
@HarrisonWilde
Copy link

Could I ask, in the following situation, what should I do instead of .apply(pd.Series):

    train_configs = experiments["train_config"].apply(pd.Series)
    model_configs = experiments["model_config"].apply(pd.Series)
    experiments = experiments.drop(columns=["train_config", "model_config"]).join(train_configs).join(model_configs)

@topper-123
Copy link
Contributor Author

You don’t show the contents of those columns, nor the intended output, so difficult to help. However, from the looks of it, you should be using ‘ experiments["train_config"].map(pd.Series)’ etc.

If not, you need to give more information anout data and intended output, so I can better assess the need.

@HarrisonWilde
Copy link

Apologies, I meant to attach the message below too:

When I run the code above, I see these warnings:

...: FutureWarning: Returning a DataFrame from Series.apply when the supplied function returns a Series is deprecated and will be removed in a future version.
  train_configs = experiments["train_config"].apply(pd.Series)
...: FutureWarning: Returning a DataFrame from Series.apply when the supplied function returns a Series is deprecated and will be removed in a future version.
  model_configs = experiments["model_config"].apply(pd.Series)

This can be reproduced with version 2.1.0 of pandas and a dataframe of the following form:

experiments = pd.DataFrame([{"train_config": {"num_epochs": 30, "patience": 10}, "model_config": {"num_layers": 3, "layer_dim": 256}}])

A lot of the discourse online implies this is the "standard" approach to unpacking dictionaries in columns. I am now using this method, but wondered if there was a recommended workflow in light of the deprecation:

    experiments = experiments.join(
        pd.DataFrame(experiments.pop("train_config").values.tolist(), index=experiments.index)
    )
    experiments = experiments.join(
        pd.DataFrame(experiments.pop("model_config").values.tolist(), index=experiments.index)
    )

@topper-123
Copy link
Contributor Author

topper-123 commented Sep 5, 2023

The way I'd do it:

>>> import pandas as pd
>>> experiments = pd.DataFrame([{
    "train_config": {"num_epochs": 30, "patience": 10}, 
    "model_config": {"num_layers": 3, "layer_dim": 256},
}])
>>> train_config = pd.concat(experiments["train_config"].map(pd.Series).values)

and something similar to model_config. Also you may want to do pd.concat(..., axis=1) if that's the output shape you're after.

@rhshadrach
Copy link
Member

@HarrisonWilde - can your input only be a single row? I get odd results if I try your method on multiple rows.

@HarrisonWilde
Copy link

@rhshadrach if by my method you mean:

    experiments = experiments.join(
        pd.DataFrame(experiments.pop("train_config").values.tolist(), index=experiments.index)
    )
    experiments = experiments.join(
        pd.DataFrame(experiments.pop("model_config").values.tolist(), index=experiments.index)
    )

then yes I have multiple rows and this seems to work fine. What odd results do you see?

I suspect there might be a neater way. The method @topper-123 mentions works for me too in that it avoids the call to apply.

@rhshadrach
Copy link
Member

I meant the original method: #52116 (comment)

@HarrisonWilde
Copy link

Interesting, here is some print out from it working for me:

In [7]: experiments
Out[7]: 
                         train_config                                 model_config
0  {'num_epochs': 30, 'patience': 10}  {'num_layers': 3, 'layer_dim': 256, 'i': 0}
1  {'num_epochs': 30, 'patience': 10}  {'num_layers': 3, 'layer_dim': 256, 'i': 1}
2  {'num_epochs': 30, 'patience': 10}  {'num_layers': 3, 'layer_dim': 256, 'i': 2}
3  {'num_epochs': 30, 'patience': 10}  {'num_layers': 3, 'layer_dim': 256, 'i': 3}
4  {'num_epochs': 30, 'patience': 10}  {'num_layers': 3, 'layer_dim': 256, 'i': 4}

In [8]:     train_configs = experiments["train_conf
   ...: ig"].apply(pd.Series)
   ...:     model_configs = experiments["model_conf
   ...: ig"].apply(pd.Series)
   ...:     experiments = experiments.drop(columns=
   ...: ["train_config", "model_config"]).join(trai
   ...: n_configs).join(model_configs)
<ipython-input-8-e8695b120768>:1: FutureWarning: Returning a DataFrame from Series.apply when the supplied function returns a Series is deprecated and will be removed in a future version.
  train_configs = experiments["train_config"].apply(pd.Series)
<ipython-input-8-e8695b120768>:2: FutureWarning: Returning a DataFrame from Series.apply when the supplied function returns a Series is deprecated and will be removed in a future version.
  model_configs = experiments["model_config"].apply(pd.Series)

In [9]: experiments
Out[9]: 
   num_epochs  patience  num_layers  layer_dim  i
0          30        10           3        256  0
1          30        10           3        256  1
2          30        10           3        256  2
3          30        10           3        256  3
4          30        10           3        256  4

(Including the aforementioned warnings), it does seem to work fine on multiple rows here?

@rhshadrach
Copy link
Member

I see now; I was thinking you were want to take non-duplicate keys and make them each columns regardless of the row they occurred on. Now it makes sense what you're going for.

@phofl
Copy link
Member

phofl commented Sep 8, 2023

Unpacking dictionaries is common enough that there should be an easy way of doing this. Not necessarily apply, but also not the concat solution above, that's ugly, especially since pandas doesn't work very well with nested data

@HarrisonWilde
Copy link

@phofl agreed, I know there is a json_normalize function, perhaps there could also be a dict_normalize function or something for this specific purpose?

@rhshadrach
Copy link
Member

Unpacking dictionaries is common enough that there should be an easy way of doing this.

What are users doing that running into this is common? Assuming it is common, I'm in favor of reverting the deprecation temporarily in 2.1.1 and adding this functionality - then reintroducing the deprecation.

cc @topper-123 for any thoughts.

@phofl
Copy link
Member

phofl commented Sep 9, 2023

What are users doing that running into this is common? Assuming it is common, I'm in favor of reverting the deprecation temporarily in 2.1.1 and adding this functionality - then reintroducing the deprecation.

Yep strong +1

I think this holds for nested data generally (I've also run into lists in the past, but way less common than dicts I think).

@topper-123
Copy link
Contributor Author

The doc string for json_normalize says that can be dict or list of dicts, so we can do:

>>> experiments = pd.DataFrame([{"train_config": {"num_epochs": 30, "patience": 10}, "model_config": {"num_layers": 3, "layer_dim": 256}}])
>>> pd.json_normalize(experiments.to_dict())
   train_config.0.num_epochs  train_config.0.patience  model_config.0.num_layers  model_config.0.layer_dim
0                         30                       10                          3                       256
>>>  pd.json_normalize(experiments["train_config"]). # single series
   num_epochs  patience
0          30        10

Does json_normalize fit your use case, @HarrisonWilde?

@phofl
Copy link
Member

phofl commented Sep 11, 2023

This does not work for other nested data

@topper-123
Copy link
Contributor Author

But json_normalize is the the function we currently use to extract nested data. Do you want to implement a new function for nested data in parallel to json_normalize?

@phofl
Copy link
Member

phofl commented Sep 11, 2023

I don't have a good answer for this yet.

My reasoning is: I don't want to kill valid use-cases with this deprecation without providing an alternative. The main reason for this deprecation was that there are more efficient ways of doing things. This seems to be incorrect for a bunch of cases that have valid use

@topper-123
Copy link
Contributor Author

I propose we'll wait until @HarrisonWilde comments on if json_normalize works for his use cases.

Just a reminder: In #54747 the current consensus (in my reading of the discussion) is to deprecate the apply method. In that case, this will be deprecated anyway together with the apply method.

@phofl
Copy link
Member

phofl commented Sep 11, 2023

My concerns are about nested data in general, not only dicts

FWIW I am not onboard with deprecating apply. That's a longer time horizon even if that happens anyway

@HarrisonWilde
Copy link

This does work for my use case provided there is a way to remove the dotting / concatenation in the auto column-naming. Presumably there is, I haven't had chance to properly look at this yet.

@mrgransky
Copy link

The doc string for json_normalize says that can be dict or list of dicts, so we can do:

>>> experiments = pd.DataFrame([{"train_config": {"num_epochs": 30, "patience": 10}, "model_config": {"num_layers": 3, "layer_dim": 256}}])
>>> pd.json_normalize(experiments.to_dict())
   train_config.0.num_epochs  train_config.0.patience  model_config.0.num_layers  model_config.0.layer_dim
0                         30                       10                          3                       256
>>>  pd.json_normalize(experiments["train_config"]). # single series
   num_epochs  patience
0          30        10

Does json_normalize fit your use case, @HarrisonWilde?

Thanks, this does not return the following warning:

FutureWarning: Returning a DataFrame from Series.apply when the supplied function returns a Series is deprecated and will be removed in a future version.

I tried 3 approaches as follows:

# new_df = df.set_index("col")["nested_dict"].apply(pd.Series).astype("float32") #  FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning. # 
# new_df = df.set_index("col")["nested_dict"].apply(lambda x: pd.Series(x, dtype="object")).astype("float32") # FutureWarning: Returning a DataFrame from Series.apply when the supplied function returns a Series is deprecated and will be removed in a future version.
new_df = pd.json_normalize(df["nested_dict"]).set_index(df["col"]).astype("float32") # works without any warning using  Pandas: 2.1.0

@hansalemaos
Copy link

hansalemaos commented Oct 12, 2023

What a pity, it's a great functionality.
This worked for me, however it is anything, but beautiful.

s=pd.Series([(1,2,3,4),(1,2,3,4),(1,2,3,4),(1,2,3,4)])

s.to_frame().apply(lambda x:x[s.name if s.name else 0], axis=1, result_type='expand')

 

@phofl
Copy link
Member

phofl commented Oct 12, 2023

The deprecation was reverted on 2.1.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Deprecate Functionality to remove in pandas Series Series data structure
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants