-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modin on Ray cluster with object spilling: process was killed when loading parquet file #6639
Comments
Would any @modin-project/modin-core members have some insight here? |
@zaichang thanks for contribution. Did you see any warnings like |
@anmyachev I checked my logs and yes there were these lines:
It's confusing though because the only option I passed in is
Besides, if S3 storage_options are not supported I would expect that it would fail before reading any data rather than an out of memory error. |
@zaichang it's strange, because |
@zaichang you are right. We will correct the documentation. |
@zaichang "The Ray object store allocates 30% of host memory to the shared memory (/dev/shm, unless you specify --object-store-memory).". When Modin starts a Ray local cluster itself, it uses 60% of the total memory. I advise you to also increase the size of the shared memory via |
@anmyachev Thank you very much for getting back to me. I'll try adding The logs that mentioned Is the situation any different with a Dask cluster? |
@zaichang Most likely, Ray's spilling will not help when the size of the dataframe is close to the total size of RAM and there are operations in the workload that completely materialize the dataframe in one process. This mechanism works much better when there are many small objects whose total memory consumption exceeds the available memory, or when there are no operations in the workload that are performed entirely in the pandas.
I'm not sure, but it seems that the situation there cannot be better than in Ray, because this object spilling occurs when there is an attempt to write data to a storage where there is not enough memory (in the storage). Here the problem is insufficient available memory (RAM) at the time of object materialization in the process. |
@zaichang I'm running into a similar issue with Ray, Modin, Dask, and Dask Distributed. I've found that these problems only occur with parquet files (a single 52 GB CSV file poses no problem), so I'm wondering if there's something weird with pyarrow. I'm looking into getting modin to work with fastparquet, but so far it doesn't quite work out-of-the-box. I'd love to chat more about potential solutions and potential sources of the problem — my email is in my Github profile. |
2023-12-29 Also experiencing out-of-core not working for csv,parquet on 32 GB ram using modin[dask],modin[ray],modin[unidist]..mpi |
@zaichang One thing that still bothers me is that when reading parquet it says that parquet options are not supported. Are you still seeing this message? Do you specify in the reproducer all the parameters you used for @seydar It is very interesting that such a large csv file works, but the parquet file does not. If your setup is different from the one in this issue, could you create a separate one? We would investigate this issue. @chriscalip Could you also create a separate issue with all the details that you are able to provide for us? |
Just to note I am not clear:
|
@anmyachev Thanks for checking in. I recently re-checked my test and indeed I am not seeing the I've corrected the above in my tests but the main issue as described in this ticket is still reproducible using a 16GB machine trying to read either a 10GB parquet file or CSV. @chriscalip you need to give modin+ray enough headroom when ingesting data, and I think a rough heuristic is that you most likely need more than double the RAM of your dataset size. Dependencies used (on 2024-01-30)
Log
|
@zaichang It looks like you and I are working on similar problems, I'd love to chat more if you're interested on trading notes — my email is ari [at] aribrown [period] com. |
@zaichang thanks for the information! Sometimes objects in memory take up much more space than on disk. I'm wondering if this is the case. Could you check this using pure pandas and the https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.memory_usage.html#pandas.DataFrame.memory_usage method? BTW, do you use Ray dashboard? If so, could you look at the load distribution between the nodes? Interesting information is whether all nodes are involved and how the load is distributed right before the moment of the memory error. |
Hey guys, i am on the hunt for a solution for our OOM problems and stumbled over this thread. |
how does this not defeat the purpose? From the docs:
We work with pointclouds with hundreds of millions of rows. So a file might be 300gb, with your heuristic of needing twice the memory, we can load everything into memory, and still have 300gb left. If we have everything in memory we could just use pandas, right? |
@Liquidmasl I concur. |
While Modin on Ray is able to handle OOM datasets, that gets tricky to handle when loading data from a file. An underlying execution engine (Ray in this case) is not aware of the data that hasn't been put into the storage yet so it doesn't know what data to spill and when. After the data is kept in the storage, the engine can spill a portion of data to disk if there is no enough room for processing. You should take into account the size of data when loading it from a file. If there are no enough resources in your system to load data, you should consider data load by portions or move to a machine that has enough resources to handle your data size.
This approach should work. |
Ingesting the data in 100 parts is now tough to do, because you'd have to use pyarrow to read in the file and then split it, and pyarrow suffers from the safe bug IME. Maybe you could manually read in the row groups and then split it? (Not sure which library would support this) |
I suggest while this bug is active the claim "Out-of-memory data with Modin", https://modin.readthedocs.io/en/stable/getting_started/why_modin/out_of_core.html , should be removed from docs. Modin is currently unable to process beyond host computer available memory. |
So I tried that. I read in a ply file 100 mil rows at a time, storing them in their own modin dataframes, which i put into a list. But when I try to concatenate them using I dont know whats going wrong, its quite difficult to figure out aswell because everything stops working. Anything I am doing wrong? I kinda really need to get this to work D: EDIT: EditEdit: EditEditEdit: Summery and return to the original issue of the post:So there is some weird stuff happining here:
This works, remarkably fast even! import modin.pandas as pd
large_df.to_parquet(path) leads to a dead raylet. Unsure why, possibly memory issue import ray
import modin.pandas as pd
large_df.to_parquet(path) Leads to.. sometimes
import ray
ray.init()
import modin.pandas as pd
large_df.to_parquet(path) Leads to two initialized ray instances, but saving works (takes for ever though)! (RAM is generally chill all the time) But now that I managed to save the parquet file... Back to .parquet loadingI have saved the same data twice,
Reading the single file with modins This is obviously an issue. I will transfer this to its own issue as soon as I am sure my examples are correct, currently still testing (everything takes for ever) |
I did manage to load a dataset now. To make any of this work Loading the initial dataset in batches, making those batches to modin dfs, and concatenating them worked After saving that to a bunch of parquets, I did not manage to load them anymore with: modins read_parquet() does not work, as it just fills up RAM and pagefile until computer dies or it crashes. what DID work was to just... iterate over the .parquet files in the .parquet folder, load them one by one in own dataframes, and then again concat them together at the end. Similar to how I initially loaded the dataset. I just foundThat a partitioning column can be used on to_parquet() and then rays read_parquet() can load it again. |
Modin version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest released version of Modin.
I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)
Reproducible Example
Issue Description
After enabling object spilling, we ran a ray cluster of 4 nodes each with various instance RAM as detailed below, and attempted to read a 10GB parquet file. Only to find that workers get killed and often the kernel also dies, without ever completing the
read_parquet
operation.As a side note, given the out-of-the-box claim of "By default, Modin leverages out-of-core methods to handle datasets that don’t fit in memory for both Ray and Dask engines." https://modin.readthedocs.io/en/latest/getting_started/why_modin/out_of_core.html
It was strange then to see a note on how to enable object spilling in multi-node clusters: https://docs.ray.io/en/latest/ray-core/objects/object-spilling.html#cluster-mode without noting that it is actually disabled by default.
Our test uses Ray as an engine and tried a few configurations, and only the 64GB cluster succeeded in loading a 10GB file
Expected Behavior
With object spilling enabled, I expect the cluster to complete reading the large parquet to work even if the dataset does not fit in an instance's memory, and surprised to find that even a 32GB per instance did not work.
Error Logs
Installed Versions
INSTALLED VERSIONS
commit : 4c01f64
python : 3.9.18.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-1031-aws
Version : #35-Ubuntu SMP Fri Feb 10 02:07:18 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8
Modin dependencies
modin : 0.24.1
ray : 2.7.0
dask : None
distributed : None
hdk : None
pandas dependencies
pandas : 2.1.1
numpy : 1.26.0
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.2.2
pip : 23.2.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.16.1
pandas_datareader : None
bs4 : 4.12.2
bottleneck : None
dataframe-api-compat: None
fastparquet : 2023.4.0
fsspec : 2023.4.0
gcsfs : None
matplotlib : 3.7.1
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 12.0.0
pyreadstat : None
pyxlsb : None
s3fs : 0.4.2
scipy : 1.10.1
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
zstandard : 0.21.0
tzdata : 2023.3
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: