Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modin on Ray Cluster only running on head node for I/O Operation on TB Scale Parquet data from S3 #5652

Open
jwaller59 opened this issue Feb 14, 2023 · 3 comments
Labels
Cloud ☁️ Modin in the Cloud External Pull requests and issues from people who do not regularly contribute to modin Memory 💾 Issues related to memory question ❓ Questions about Modin

Comments

@jwaller59
Copy link

Hi,

I have an S3 bucket that contains 500 files parquet files (each parquet file is around 120-140mb in size), however when I attempt to load this data into a Modin DataFrame, utilizing ray, only three ray deploy_functions are created. This is resulting in the operation not being properly distributed across a Ray Cluster and only 3 CPUs being used at any one time.

Additionally, because this is only triggering three ray functions, and not utilizing the cluster properly - I am having to scale up the head node to ensure the RAM is > 1TB to prevent out of memory errors as the commands are only being executed on said node. This isn't scaleable as the cost associated with EC2 instances containing > 1TB of ran is significantly higher than having a distributed cluster of multiple smaller EC2 instances. The EC2 instance I am having to use r6i.32xlarge has 128 cpus available, yet I am only able to use three of them - resulting in the data load taking a significant amount of time (upwards of 9 hours for a TB) on a 50000 Megabit connection.

I have attempted to run the following code and both only ever spawn 3 Ray functions:

pd.concat([pd.read_parquet(key) for key in listed_files]) where listed files is the file locations in s3

pd.read_parquet('file_location')

I've attached screenshots of the dashboard as well as the query status
<img width="457" alt="Screen Shot 2023-02-14 at 11 37 58 am" src="https://user-images.githubusercontent.com/55753338/218608795-3dbd0435-6aa8-4be3-8c18-00
<img width="1430" alt="Screen Shot 2023-02-14 at 11 38 36 am" src="https://user-images.githubusercontent.com/55753338/218608812-05dea67d-0ca5-480d-965
Screen Shot 2023-02-14 at 11 38 47 am
2-6b7dfa90a314.png">
0662e1751e.png">

@jwaller59 jwaller59 added question ❓ Questions about Modin Triage 🩹 Issues that need triage labels Feb 14, 2023
@jwaller59 jwaller59 changed the title Modin on Ray Cluster only running on head node for I/O Operation on TB Parquet data Modin on Ray Cluster only running on head node for I/O Operation on TB Scale Parquet data from S3 Feb 14, 2023
@jwaller59
Copy link
Author

Screen Shot 2023-02-14 at 11 37 58 am

@pyrito
Copy link
Collaborator

pyrito commented Feb 15, 2023

Hi @jwaller59 thanks for opening the issue! @modin-project/modin-core can folks who have used Modin on a cluster provide some insight here?

@pyrito pyrito added Memory 💾 Issues related to memory Cloud ☁️ Modin in the Cloud and removed Triage 🩹 Issues that need triage labels Feb 15, 2023
@jwaller59
Copy link
Author

It seems to be a specific issue with these files vs others. Modin works as expected when retrieving datasets in Parquet from other S3 Folders, the only key differences between the data being retrieved in the above session where only 3 CPUs are triggered vs other sessions is as described previously:

Data Size - Each file is 120mb compressed snappy Parquet

Paritioning within S3 - due to the nature of the files, there is no "partitioning" within the S3 bucket for the parquet files as there is no reasonable partition key for the dataset.

Not sure if one of these could be whats driving this problem.

@anmyachev anmyachev added the External Pull requests and issues from people who do not regularly contribute to modin label Apr 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Cloud ☁️ Modin in the Cloud External Pull requests and issues from people who do not regularly contribute to modin Memory 💾 Issues related to memory question ❓ Questions about Modin
Projects
None yet
Development

No branches or pull requests

3 participants