Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up ExpireSnapshotAction Test by Reducing Shuffle Paralleism #1362

Merged
merged 1 commit into from
Aug 20, 2020

Conversation

RussellSpitzer
Copy link
Member

Because we use LocalIterator in ExpireSnapshotAction, every partition
runs it's own spark job, almost all of which are completely empty. This
leads to a lot of overhead which we don't need in the Test Suite. Setting
shuffle parallelism to 1 (from 200) greatly reduces the test runtime.

Because we use LocalIterator in ExpireSnapshotAction, every partition
runs it's own spark job, almost all of which are completely empty. This
leads to a lot of overhead which we don't need in the Test Suite. Setting
shuffle parallelism to 1 (from 200) greatly reduces the test runtime.
@rdblue
Copy link
Contributor

rdblue commented Aug 20, 2020

Thanks! That looks much better.

@rdblue rdblue merged commit 1af5a8f into apache:master Aug 20, 2020
@rdblue
Copy link
Contributor

rdblue commented Aug 20, 2020

@RussellSpitzer, @aokolnychyi, if using the local iterator causes a job per task to be submitted to Spark, should we avoid using it?

If every file to delete takes up 500 bytes in memory, then the driver can hold 4 million files in 2GB. That seems reasonable to me, so we may be over-optimizing by using the iterator instead of just collecting the data back.

@RussellSpitzer
Copy link
Member Author

RussellSpitzer commented Aug 20, 2020 via email

@rdblue
Copy link
Contributor

rdblue commented Aug 20, 2020

That sounds good to me!

@RussellSpitzer
Copy link
Member Author

RussellSpitzer commented Aug 21, 2020 via email

rdblue pushed a commit to rdblue/iceberg that referenced this pull request Aug 24, 2020
…lelism (apache#1362)

Because we use LocalIterator in ExpireSnapshotAction, every partition
runs it's own spark job, almost all of which are completely empty. This
leads to a lot of overhead which we don't need in the Test Suite. Setting
shuffle parallelism to 1 (from 200) greatly reduces the test runtime.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants