Modin on Ray Cluster only running on head node for I/O Operation on TB Scale Parquet data from S3 #5652

jwaller59 · 2023-02-14T00:45:16Z

Hi,

I have an S3 bucket that contains 500 files parquet files (each parquet file is around 120-140mb in size), however when I attempt to load this data into a Modin DataFrame, utilizing ray, only three ray deploy_functions are created. This is resulting in the operation not being properly distributed across a Ray Cluster and only 3 CPUs being used at any one time.

Additionally, because this is only triggering three ray functions, and not utilizing the cluster properly - I am having to scale up the head node to ensure the RAM is > 1TB to prevent out of memory errors as the commands are only being executed on said node. This isn't scaleable as the cost associated with EC2 instances containing > 1TB of ran is significantly higher than having a distributed cluster of multiple smaller EC2 instances. The EC2 instance I am having to use r6i.32xlarge has 128 cpus available, yet I am only able to use three of them - resulting in the data load taking a significant amount of time (upwards of 9 hours for a TB) on a 50000 Megabit connection.

I have attempted to run the following code and both only ever spawn 3 Ray functions:

pd.concat([pd.read_parquet(key) for key in listed_files]) where listed files is the file locations in s3

pd.read_parquet('file_location')

I've attached screenshots of the dashboard as well as the query status
<img width="457" alt="Screen Shot 2023-02-14 at 11 37 58 am" src="https://user-images.githubusercontent.com/55753338/218608795-3dbd0435-6aa8-4be3-8c18-00
<img width="1430" alt="Screen Shot 2023-02-14 at 11 38 36 am" src="https://user-images.githubusercontent.com/55753338/218608812-05dea67d-0ca5-480d-965

2-6b7dfa90a314.png">
0662e1751e.png">

The text was updated successfully, but these errors were encountered:

jwaller59 · 2023-02-14T00:46:26Z

pyrito · 2023-02-15T18:15:31Z

Hi @jwaller59 thanks for opening the issue! @modin-project/modin-core can folks who have used Modin on a cluster provide some insight here?

jwaller59 · 2023-02-15T21:43:20Z

It seems to be a specific issue with these files vs others. Modin works as expected when retrieving datasets in Parquet from other S3 Folders, the only key differences between the data being retrieved in the above session where only 3 CPUs are triggered vs other sessions is as described previously:

Data Size - Each file is 120mb compressed snappy Parquet

Paritioning within S3 - due to the nature of the files, there is no "partitioning" within the S3 bucket for the parquet files as there is no reasonable partition key for the dataset.

Not sure if one of these could be whats driving this problem.

jwaller59 added question ❓ Questions about Modin Triage 🩹 Issues that need triage labels Feb 14, 2023

jwaller59 changed the title ~~Modin on Ray Cluster only running on head node for I/O Operation on TB Parquet data~~ Modin on Ray Cluster only running on head node for I/O Operation on TB Scale Parquet data from S3 Feb 14, 2023

pyrito added Memory 💾 Issues related to memory Cloud ☁️ Modin in the Cloud and removed Triage 🩹 Issues that need triage labels Feb 15, 2023

anmyachev added the External Pull requests and issues from people who do not regularly contribute to modin label Apr 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modin on Ray Cluster only running on head node for I/O Operation on TB Scale Parquet data from S3 #5652

Modin on Ray Cluster only running on head node for I/O Operation on TB Scale Parquet data from S3 #5652

jwaller59 commented Feb 14, 2023

jwaller59 commented Feb 14, 2023

pyrito commented Feb 15, 2023

jwaller59 commented Feb 15, 2023

Modin on Ray Cluster only running on head node for I/O Operation on TB Scale Parquet data from S3 #5652

Modin on Ray Cluster only running on head node for I/O Operation on TB Scale Parquet data from S3 #5652

Comments

jwaller59 commented Feb 14, 2023

jwaller59 commented Feb 14, 2023

pyrito commented Feb 15, 2023

jwaller59 commented Feb 15, 2023