Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for pushdown like filter (endsWith and contains) #9683

Closed
wants to merge 2 commits into from

Conversation

yabola
Copy link
Contributor

@yabola yabola commented Feb 8, 2024

This PR supports pushing down endsWith (like %x) and contains (like %x%) to Iceberg. The benefits are:

  1. Support for early filtering of partition columns in partitioned tables, before this PR iceberg needs to scan the entire table.
  2. Support for pushdown agg under certain scenarios (before endWiths or contains could not be pushed down, so it could not pushdown agg).
  3. Support for filtering files using Parquet dictionaries for regular columns in tables.

Before this PR, iceberg only support startWith.

@yabola yabola changed the title Support pushdown like filter (endsWith and contains) Support for pushdown like filter (endsWith and contains) Feb 8, 2024
@yabola
Copy link
Contributor Author

yabola commented Feb 8, 2024

I have an example of performance comparison.
Table : p_lineorder_ice has partition column LO_ORDERDATE.

  1. pushdown partition column
    test sql: select * from ice.ssb10.p_lineorder_ice where LO_ORDERDATE like '%01'
    before this pr:
image

after this pr:
image

  1. pushdown agg column
    test sql: select count(1) from ice.ssb10.p_lineorder_ice where LO_ORDERDATE like '%01';
    before this PR:
image

after this PR:
image

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yabola This is very cool, could we break up the PR though for easier review? Let me take a deeper look at the code before I propose a way to break it up into separate PRs. On the surface though it seems like we should be able to break apart the expression changes in API/Core and then have separate PRs for the file formats and Spark integration?

@amogh-jahagirdar
Copy link
Contributor

amogh-jahagirdar commented Feb 8, 2024

Also, I'll need to think more if we can actually support this for delete files. If not, this will need to only be applied for CoW tables. For example, for agg pushdown, we don't support MoR tables (well unless it's compacted)

@yabola
Copy link
Contributor Author

yabola commented Feb 9, 2024

@yabola This is very cool, could we break up the PR though for easier review? Let me take a deeper look at the code before I propose a way to break it up into separate PRs. On the surface though it seems like we should be able to break apart the expression changes in API/Core and then have separate PRs for the file formats and Spark integration?

I will break up my PR. Thanks for your advice.

@yabola
Copy link
Contributor Author

yabola commented Feb 11, 2024

Also, I'll need to think more if we can actually support this for delete files. If not, this will need to only be applied for CoW tables. For example, for agg pushdown, we don't support MoR tables (well unless it's compacted)

@amogh-jahagirdar I checked the code. When there are delete rows, it won't pushdown agg and I have tested it.

if (!task.deletes().isEmpty()) {
LOG.info("Skipping aggregate pushdown: detected row level deletes");
return false;
}

Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Oct 14, 2024
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Oct 22, 2024
@yabola
Copy link
Contributor Author

yabola commented Feb 10, 2025

@rdblue @amogh-jahagirdar Hi~ could I reopen this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants