-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add option to use absolute value in pct_change #56618
Comments
Thanks for the report. Taking the absolute value is common but not universal. It also gives counter-intuitive results - e.g. going from -20 to 20 is a smaller percent change than going from -10 to 20 (200% vs 300%). In my opinion, one should not think of "percent change" as a measure how much something is getting more positive, but rather how much it's moving away from zero. Positive percent change means it's moving more away from zero, negative percent change means it's moving toward (and possibly crossing) 0. With this interpretation, not taking absolute value is correct. Since it's fairly common to take the absolute value however, I wouldn't be opposed to adding an argument. |
I've reworked this issue as an enhancement request. |
can I work on this |
cc @pandas-dev/pandas-core - any thoughts on adding an argument here? |
isn't there an absolute function that can be applied to the result to return exactly the intent without the need for an argument? I dont believe percent change should be absolute by default and it feels more pythonic, to me, to manipulate the outcome with existing functions than adding a keyword |
@attack68 - we need to apply |
Adding an argument allows keeping the current behavior, which can also be non-intuitive: >>> import pandas as pd
>>> s=pd.Series([-40, -30, -20])
>>> s.pct_change()
0 NaN
1 -0.250000
2 -0.333333
dtype: float64 It seems odd to report -0.25 when the value increased by 25% One could also argue that computing percent change relative to a negative number is not well defined, and we should raise an exception. So maybe the extra argument should have 3 options:
Advantage of the third option is that it might help identify an error in your dataset or data pipeline. |
I don't think so. Ref: #56618 (comment) |
It probably depends on what you are measuring in the original series, and how you interpret the result. Your example was flipping the sign between 2 adjacent entries. My example is keeping the sign negative between two entries. It seems to me that a positive percent change means things increased, and a negative percent change means things decreased. In my example, seeing a -0.25 when the value increased seems counter-intuitive. And if the sign flipped, I'm not sure how to interpret the result (hence my suggestion to raise an exception) |
I don't think this is significant. In your example, we are going from -40 to -30. We agree going from 40 to 30 or 40 to 50 is a -25% and 25% change respectively. My claim here is that you should interpret the negative sign not as "getting more positive" or "getting more negative", but rather, moving toward 0. In this interpretation, going from -40 to -30 being a -25% change is not counter-intuitive.
It does - as long as you consider the "size of a number" to be its absolute value. For example, -100000 is a large number when compared to 0.1. |
Good point. No opinion either way then. |
Do you have a suggestion on what the new argument should be? Personally, between
I think I'd find the second one clearer |
Agree I would do the second by default and probably not even assume that a |
I think the fundamental issue here is that pct change doesn't make much sense when the sign of the a series is not consistent (and in almost all applications, strictly positive). I can see some real downsides of letting errors in code or data go unnoticed if I agree that the existence of |
I read through the discussions in the 3 links that @WillAyd provided. While I'd like us to consider either removing |
could just document |
I also thought removing it might be the way to go - trim down the library with a structured deprecation notice advising how best to replicate the behaviour. |
I don't think that helps the users discover when they should use the workaround, so that's why I also suggested raising a warning or exception if there are mixed signs in the series. |
The current behavior of |
Yes, but from the multiple people that have raised issues indicating they were confused with the result, that indicates the values are subject to interpretation, and there is even a debate on what is "mathematically correct". IMHO, if you have mixed signs in your series, it is a possible indication that you have an error in your data, and that the results may come out in an unexpected way. We could also add an argument such as |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
correct result for row 1 should be 76.647059 and not -76.647059. It seems the current implementation of pct_change() lacks the abs() function and thus returns the wrong sign when the base is negative
Expected Behavior
percentage_change = (new_value - old_value) / abs(old_value)
(5144 - (-68)) / abs(-68) = 76.647059
Installed Versions
INSTALLED VERSIONS
commit : a671b5a
python : 3.11.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.133.1-microsoft-standard-WSL2
Version : #1 SMP Thu Oct 5 21:02:42 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.1.4
numpy : 1.24.3
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.2.2
pip : 23.3.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.17.2
pandas_datareader : 0.10.0
bs4 : 4.12.2
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : 2023.9.1
gcsfs : None
matplotlib : 3.8.0
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 13.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.2
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 2.0.1
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: