Modin on Ray Cluster only running on head node for I/O Operation on TB Scale Parquet data from S3 #5652
Labels
Cloud ☁️
Modin in the Cloud
External
Pull requests and issues from people who do not regularly contribute to modin
Memory 💾
Issues related to memory
question ❓
Questions about Modin
Hi,
I have an S3 bucket that contains 500 files parquet files (each parquet file is around 120-140mb in size), however when I attempt to load this data into a Modin DataFrame, utilizing ray, only three ray deploy_functions are created. This is resulting in the operation not being properly distributed across a Ray Cluster and only 3 CPUs being used at any one time.
Additionally, because this is only triggering three ray functions, and not utilizing the cluster properly - I am having to scale up the head node to ensure the RAM is > 1TB to prevent out of memory errors as the commands are only being executed on said node. This isn't scaleable as the cost associated with EC2 instances containing > 1TB of ran is significantly higher than having a distributed cluster of multiple smaller EC2 instances. The EC2 instance I am having to use r6i.32xlarge has 128 cpus available, yet I am only able to use three of them - resulting in the data load taking a significant amount of time (upwards of 9 hours for a TB) on a 50000 Megabit connection.
I have attempted to run the following code and both only ever spawn 3 Ray functions:
pd.concat([pd.read_parquet(key) for key in listed_files])
where listed files is the file locations in s3pd.read_parquet('file_location')
I've attached screenshots of the dashboard as well as the query status
<img width="457" alt="Screen Shot 2023-02-14 at 11 37 58 am" src="https://user-images.githubusercontent.com/55753338/218608795-3dbd0435-6aa8-4be3-8c18-00
<img width="1430" alt="Screen Shot 2023-02-14 at 11 38 36 am" src="https://user-images.githubusercontent.com/55753338/218608812-05dea67d-0ca5-480d-965
2-6b7dfa90a314.png">
0662e1751e.png">
The text was updated successfully, but these errors were encountered: