Description
I'm experiencing a very weird bug with one very specific dataset - when I try to use pandas' rolling.std function on it.
Basically - the setup is -
- I have a dataframe with 1 column stored in float32 format
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2186 entries, 2010-11-30 16:00:00 to 2019-08-16 16:00:00
Data columns (total 1 columns):
high 2186 non-null float32
dtypes: float32(1)
memory usage: 25.6 KB
ddf.head()
Here is a plot of the full timeseries (left column) along with a tail of 200 rows (right column)
Note the scale of the numbers here - it goes from 1e8 to 1e1.
Next I compute the rolling standard deviations using rolling means, as per:
rs = np.sqrt((ddf.high ** 2).rolling(10).mean() - (ddf.high.rolling(10).mean() ) ** 2)
This is what the rolling std computed this way it looks like (and matches what I would expect):
But if I use
rs = ddf.rolling(10).high.std()
this is what I get:
Something has gotten corrupted - as we can see in the tail of 200 rows in the right.
Now - however, if I rescale the data to make the numbers sit in a smaller range, compute the rolling std and scale it back up
ddf = ddf.assign( high_rescaled=ddf.high / 1e8 )
rs = ddf.rolling(10).high_rescaled.std() * 1e8
This is what I get
which matches the output computed using rolling means !
Note - the original data was in np.float32 format. So I thought that this bug might be happening because of some overflow issues (which it really should not - the window is only 10 long !!).
So I converted the data to float64 to test this:
ddf.high = ddf.high.astype(np.float64)
display( ddf.info() )
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2186 entries, 2010-11-30 16:00:00 to 2019-08-16 16:00:00
Data columns (total 1 columns):
high 2186 non-null float64
dtypes: float64(1)
memory usage: 34.2 KB
In this case - even applying rolling.std() to rescaled version of the data (which worked for float32) - is broken !!
Minimum Reproducible Example
Regarding generating an MRE - here is the rub.
The bug seems to be a function of the numerics specific to this dataset - and I cannot reproduce it using random data (since I don't know what feature of the numerics is causing this).
Now, if I save the data as a pickle file (using df.to_pickle) and load it back in - I can reproduce these results exactly.
However, if I save it as a csv file (for sharing here) - and load it back in - I get a whole new level of badness. The results look really bad for all cases after this round-tripping. This seems to indicate that there is some thing about the exact numbers of the dataset that is triggering some numerical problems with rolling.std.
from pylab import *
import pandas as pd
ddf = pd.read_csv( '/path/to/rs.csv')
ddf.high = ddf.high
display( ddf.info() )
figure()
subplot(121)
plot( ddf.high, '-r' )
subplot(122)
plot( ddf.high.tail(200), '-r' )
gcf().suptitle( 'original data' )
figure()
subplot(121)
rs = np.sqrt( (ddf.high ** 2).rolling(10).mean() - (ddf.high.rolling(10).mean() ) ** 2 )
plot( rs, '-b' )
subplot(122)
plot( rs.tail(200), '-b' )
gcf().suptitle( 'rolling std computed using rolling means on original data' )
figure()
subplot(121)
rs = ddf.rolling(10).high.std()
plot( rs, '-b' )
subplot(122)
plot( rs.tail(200), '-b' )
gcf().suptitle( 'rolling std computed on original data' )
figure()
subplot(121)
ddf = ddf.assign( high_rescaled=ddf.high / 1e8 )
rs = ddf.rolling(10).high_rescaled.std() * 1e8
plot( rs, '-b' )
subplot(122)
plot( rs.tail(200), '-b' )
gcf().suptitle( 'rolling std computed on rescaled data' )
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-9-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.24.2
pytest: None
pip: 19.2.3
setuptools: 41.2.0
Cython: None
numpy: 1.16.4
scipy: 1.3.1
pyarrow: 0.13.0
xarray: 0.13.0+6.g4617e68b
IPython: 7.8.0
sphinx: None
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.1.0
openpyxl: None
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: 4.4.1
bs4: 4.8.0
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: 0.3.0