-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DEPR: Deprecate returning a DataFrame in Series.apply #52116
Comments
Could I ask, in the following situation, what should I do instead of train_configs = experiments["train_config"].apply(pd.Series)
model_configs = experiments["model_config"].apply(pd.Series)
experiments = experiments.drop(columns=["train_config", "model_config"]).join(train_configs).join(model_configs) |
You don’t show the contents of those columns, nor the intended output, so difficult to help. However, from the looks of it, you should be using ‘ experiments["train_config"].map(pd.Series)’ etc. If not, you need to give more information anout data and intended output, so I can better assess the need. |
Apologies, I meant to attach the message below too: When I run the code above, I see these warnings:
This can be reproduced with version 2.1.0 of experiments = pd.DataFrame([{"train_config": {"num_epochs": 30, "patience": 10}, "model_config": {"num_layers": 3, "layer_dim": 256}}]) A lot of the discourse online implies this is the "standard" approach to unpacking dictionaries in columns. I am now using this method, but wondered if there was a recommended workflow in light of the deprecation: experiments = experiments.join(
pd.DataFrame(experiments.pop("train_config").values.tolist(), index=experiments.index)
)
experiments = experiments.join(
pd.DataFrame(experiments.pop("model_config").values.tolist(), index=experiments.index)
) |
The way I'd do it: >>> import pandas as pd
>>> experiments = pd.DataFrame([{
"train_config": {"num_epochs": 30, "patience": 10},
"model_config": {"num_layers": 3, "layer_dim": 256},
}])
>>> train_config = pd.concat(experiments["train_config"].map(pd.Series).values) and something similar to |
@HarrisonWilde - can your input only be a single row? I get odd results if I try your method on multiple rows. |
@rhshadrach if by my method you mean: experiments = experiments.join(
pd.DataFrame(experiments.pop("train_config").values.tolist(), index=experiments.index)
)
experiments = experiments.join(
pd.DataFrame(experiments.pop("model_config").values.tolist(), index=experiments.index)
) then yes I have multiple rows and this seems to work fine. What odd results do you see? I suspect there might be a neater way. The method @topper-123 mentions works for me too in that it avoids the call to |
I meant the original method: #52116 (comment) |
Interesting, here is some print out from it working for me:
(Including the aforementioned warnings), it does seem to work fine on multiple rows here? |
I see now; I was thinking you were want to take non-duplicate keys and make them each columns regardless of the row they occurred on. Now it makes sense what you're going for. |
Unpacking dictionaries is common enough that there should be an easy way of doing this. Not necessarily apply, but also not the concat solution above, that's ugly, especially since pandas doesn't work very well with nested data |
@phofl agreed, I know there is a |
What are users doing that running into this is common? Assuming it is common, I'm in favor of reverting the deprecation temporarily in 2.1.1 and adding this functionality - then reintroducing the deprecation. cc @topper-123 for any thoughts. |
Yep strong +1 I think this holds for nested data generally (I've also run into lists in the past, but way less common than dicts I think). |
The doc string for >>> experiments = pd.DataFrame([{"train_config": {"num_epochs": 30, "patience": 10}, "model_config": {"num_layers": 3, "layer_dim": 256}}])
>>> pd.json_normalize(experiments.to_dict())
train_config.0.num_epochs train_config.0.patience model_config.0.num_layers model_config.0.layer_dim
0 30 10 3 256
>>> pd.json_normalize(experiments["train_config"]). # single series
num_epochs patience
0 30 10 Does |
This does not work for other nested data |
But json_normalize is the the function we currently use to extract nested data. Do you want to implement a new function for nested data in parallel to json_normalize? |
I don't have a good answer for this yet. My reasoning is: I don't want to kill valid use-cases with this deprecation without providing an alternative. The main reason for this deprecation was that there are more efficient ways of doing things. This seems to be incorrect for a bunch of cases that have valid use |
I propose we'll wait until @HarrisonWilde comments on if json_normalize works for his use cases. Just a reminder: In #54747 the current consensus (in my reading of the discussion) is to deprecate the apply method. In that case, this will be deprecated anyway together with the apply method. |
My concerns are about nested data in general, not only dicts FWIW I am not onboard with deprecating apply. That's a longer time horizon even if that happens anyway |
This does work for my use case provided there is a way to remove the dotting / concatenation in the auto column-naming. Presumably there is, I haven't had chance to properly look at this yet. |
Thanks, this does not return the following warning:
I tried 3 approaches as follows:
|
What a pity, it's a great functionality.
|
The deprecation was reverted on 2.1.1 |
Feature Type
Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas
Series.apply
(more preciselySeriesApply.apply_standard
) normally returns aSeries
, if supplied a callable, but if the callable returns aSeries
, it converts the returned array ofSeries
into aDataframe
and returns that.This approach is exceptionally slow and should be generally avoided. For example:
The result from the above method can in almost all cases be gotten faster by using other methods. For example can the above example be written much faster as:
In the rare case where a fast approach isn't possible, users should just construct the DataFrame manually from the list of Series instead of relying on the
Series.apply
method.Also, allowing this construct complicates the return type of
Series.Apply.apply_standard
and therefore theapply
method. By removing the option to return aDataFrame
, the signature ofSeriesApply.apply_standard
will be simplified into simplySeries
.All in all, this behavior is slow and makes things complicated without any benefit, so I propose to deprecate returning a
DataFrame
fromSeriesApply.apply_standard
.The text was updated successfully, but these errors were encountered: