-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: df.astype
alters df to let df.resample(...).aggregate(...)
fail with KeyError
#5572
Comments
This minimal example even shows this in |
Hmm, I don't see that error on modin master... |
@Garra1980 I still get the error even with the lastest commit: 21ab814e2f9fd9e4874a036e2fa8e53208638614 from yesterday. Do you have different dependencies installed? |
I do see the issue for 0.18 but test works when I switch to env with |
@Garra1980 Thank you for your packages. Therefore, please adjust the last line of the example code to: print(resampler.aggregate(aggregation)) I can still reproduce the error with this adjustment. Please have a look if that is still the case for you. I think, it may has something to do with lazy evaluation??? |
thanks, now I see it. yes, lazy execution does the trick here and I tried with printing at the very beginning of erorr checking, thought print(resampler) would be enough but apparently not |
I dug into this a bit and I think this might be a partitioning issue. When do the In [42]: df._query_compiler._modin_frame._partitions
Out[42]:
array([[<modin.core.execution.ray.implementations.pandas_on_ray.partitioning.partition.PandasOnRayDataframePartition object at 0x1c29b1fa0>,
<modin.core.execution.ray.implementations.pandas_on_ray.partitioning.partition.PandasOnRayDataframePartition object at 0x1c29b1e80>]],
dtype=object)
In [43]: df._query_compiler._modin_frame._partitions[0][0].get()
Out[43]:
a
time
2001-01-02 04:56:00 1.0
2001-01-02 04:57:00 1.0
2001-01-02 04:58:00 1.0
2001-01-02 04:59:00 1.0
2001-01-02 05:00:00 1.0
In [44]: df._query_compiler._modin_frame._partitions[0][1].get()
Out[44]:
b
time
2001-01-02 04:56:00 1
2001-01-02 04:57:00 1
2001-01-02 04:58:00 1
2001-01-02 04:59:00 1
2001-01-02 05:00:00 1 The resample aggregate seems to be applying the function to the partitions and is obviously not able to find it in the first partition hence the KeyError. Without the cc: @modin-project/modin-core |
@pyrito thanks for posting that explanation. So resample aggregate parallelizes over column-wise partitions, but that doesn't work for an aggregation that is column-dependent. I think there's a fairly simple fix of 1) adding a special case here for |
Modin version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest released version of Modin.
I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)
Reproducible Example
Issue Description
TL;DR;
Using
astype
to change a columns dtype let's aggregation fail with KeyError.Changing the dtype of a df's column (which can be very important for huge dfs; in this example to float64, but could be any) aggregation over time will raise a KeyError, where the column 'b' is not found (over which the aggregation is done).
This is, of course, a minimalistic example, but it shows the issue very well.
WORKAROUND
Transforming the df to pandas and back to modin.pandas solves the issue but the forth and back transformed df equals to the not transformed df, which makes it hard for me to find any differences.
Expected Behavior
The aggregation step should not raise any KeyError like vanilla pandas does.
Error Logs
Installed Versions
On masters latest commit
Modin dependencies
modin : 0.18.0+50.g3c997914
ray : 2.1.0
dask : 2023.1.0
distributed : 2023.1.0
hdk : None
pandas dependencies
pandas : 1.5.3
numpy : 1.23.4
pytz : 2022.6
dateutil : 2.8.2
setuptools : 65.5.0
pip : 21.1.3
Cython : None
pytest : 7.2.0
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.6.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2022.11.0
gcsfs : None
matplotlib : 3.5.3
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 10.0.1
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.9.3
snappy : None
sqlalchemy : 1.3.24
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None
With modin 1.7.1
Modin dependencies
modin : 0.17.1
ray : 1.12.1
dask : None
distributed : None
hdk : None
pandas dependencies
pandas : 1.5.2
numpy : 1.23.4
pytz : 2022.7
dateutil : 2.8.2
setuptools : 65.6.3
pip : 22.3.1
Cython : None
pytest : 7.2.0
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.2
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.6.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2022.11.0
gcsfs : None
matplotlib : 3.5.3
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.9.3
snappy : None
sqlalchemy : 1.3.24
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None
The text was updated successfully, but these errors were encountered: