Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Iceberg] Hash Bucket Aware Table Joins #189

Open
pdames opened this issue Aug 16, 2023 · 2 comments
Open

[Iceberg] Hash Bucket Aware Table Joins #189

pdames opened this issue Aug 16, 2023 · 2 comments
Assignees
Labels
iceberg This issue is related to Apache Iceberg catalog support

Comments

@pdames
Copy link
Member

pdames commented Aug 16, 2023

When joining two hash-partitioned Iceberg tables by their hash-partitioned columns, we should ensure that we either (1) pass information about the hash bucket that each file exists in (if any) as a hint to the join compute engine (e.g. Daft) so that it can automatically prune files whose records are known to not satisfy the join predicate or (2) prune these files before handing them off to the compute engine.

The 2nd approach is more flexible in terms of extending the optimization to more compute engines since it doesn't require the underlying engine to support hint-based pruning and may thus be preferred in the short term, while the 1st approach presents a more clear decoupling of responsibilities between DeltaCAT and the compute engine to aid long-term maintainability.

@pdames
Copy link
Member Author

pdames commented Aug 16, 2023

Initial distributed join integration testing will depend on completion of: #190

@pdames
Copy link
Member Author

pdames commented Aug 16, 2023

There's also an opportunity to share common code required for general cross-catalog support for hash-bucketed compaction at: #150 (since both compute problems depend in part on efficiently detecting which files may contain records with one or more equal field values).

@pdames pdames added the iceberg This issue is related to Apache Iceberg catalog support label Aug 16, 2023
@pdames pdames self-assigned this Aug 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
iceberg This issue is related to Apache Iceberg catalog support
Projects
None yet
Development

No branches or pull requests

1 participant