Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Switch usage to DataFileSet / DeleteFileSet #11158

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

nastra
Copy link
Contributor

@nastra nastra commented Sep 18, 2024

Usage of DataFileSet / DeleteFileSet
====================================

Benchmark                                       (numFiles)  Mode  Cnt  Score   Error  Units
ReplaceDeleteFilesBenchmark.replaceDeleteFiles       50000    ss    5  0.892 ± 1.837   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles      100000    ss    5  1.544 ± 0.139   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles      500000    ss    5  2.487 ± 0.542   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles     1000000    ss    5  2.534 ± 1.214   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles     2500000    ss    5  6.490 ± 2.507   s/op


main
=====
Benchmark                                       (numFiles)  Mode  Cnt  Score   Error  Units
ReplaceDeleteFilesBenchmark.replaceDeleteFiles       50000    ss    5  0.897 ± 1.884   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles      100000    ss    5  1.518 ± 0.133   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles      500000    ss    5  2.622 ± 0.862   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles     1000000    ss    5  2.540 ± 0.922   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles     2500000    ss    5  6.608 ± 3.326   s/op

@github-actions github-actions bot added the spark label Sep 18, 2024
@nastra nastra force-pushed the content-file-comparator branch 3 times, most recently from 5c25dd3 to 10150bf Compare September 18, 2024 13:12
@nastra nastra marked this pull request as draft September 18, 2024 13:56
@nastra nastra force-pushed the content-file-comparator branch 2 times, most recently from 1e0b413 to 54cffb4 Compare September 19, 2024 06:42
@nastra nastra changed the title Core: Add ContentFile comparator Core: Add Data/Delete file comparators Sep 19, 2024
@nastra nastra force-pushed the content-file-comparator branch 3 times, most recently from 4dbf58c to 73638a6 Compare September 19, 2024 13:58
@nastra nastra force-pushed the content-file-comparator branch 5 times, most recently from 70dcd1a to 6fc4e35 Compare September 19, 2024 14:36
@nastra nastra force-pushed the content-file-comparator branch 4 times, most recently from 1e6dfd7 to 8002680 Compare September 26, 2024 13:33
@nastra nastra changed the title Core: Add Data/Delete file comparators Core: Switch all places to use DataFileSet / DeleteFileSet Sep 30, 2024
@nastra nastra closed this Sep 30, 2024
@nastra nastra reopened this Sep 30, 2024
@nastra nastra force-pushed the content-file-comparator branch 3 times, most recently from 95d12ef to 08ccda9 Compare October 1, 2024 06:17
@github-actions github-actions bot removed the API label Oct 1, 2024
@nastra nastra changed the title Core: Switch all places to use DataFileSet / DeleteFileSet Core: Switch usage to DataFileSet / DeleteFileSet Oct 1, 2024
@nastra nastra marked this pull request as ready for review October 1, 2024 06:52
core/src/main/java/org/apache/iceberg/FastAppend.java Outdated Show resolved Hide resolved
private final CharSequenceSet newDataFilePaths = CharSequenceSet.empty();
private final CharSequenceSet newDeleteFilePaths = CharSequenceSet.empty();
private final DataFileSet newDataFiles = DataFileSet.create();
private final DeleteFileSet newDeleteFiles = DeleteFileSet.create();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need these extra collections? Can't we use sets in newDataFilesBySpec and newDeleteFilesBySpec?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm handling this already in 4d42f18. I just didn't want to introduce too many changes/refactorings as the PR is already quite large

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with @nastra offline, overall feel like the change is good but I think it's worth running some benchmarks as a sanity check. There really shouldn't be much of a change afterwards but it's more so to make sure there's not some regression we're not expecting.

@nastra nastra force-pushed the content-file-comparator branch 2 times, most recently from 83c6052 to 0df0066 Compare October 9, 2024 12:41
@nastra
Copy link
Contributor Author

nastra commented Oct 9, 2024

Discussed with @nastra offline, overall feel like the change is good but I think it's worth running some benchmarks as a sanity check. There really shouldn't be much of a change afterwards but it's more so to make sure there's not some regression we're not expecting.

@amogh-jahagirdar I added benchmark results to the PR description

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants